Thanks for your comments!
Our research lab is mostly focused on NLP and IR. So we are aiming at
good throughput and also a reasonable storage capacity ~3/4TB.

--
Sérgio Nunes

On Wed, Mar 5, 2008 at 6:51 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
>  The right answer really depends on your workload and what your needs and
>  goals are.
>
>  You say that this is a research lab.  If you are researching parallel
>  algorithms, then I would recommend much higher parallelism.
>
>  If you are working on problems where you want throughput, then the answer
>  may be a bit different.  In that case, the two major considerations are
>  aggregate disk speed (proportional to number of drives/interfaces) and
>  aggregate CPU speed.  Much of my work load is disk limited so I find having
>  more machines each with a disk to saturate is a good idea.  Having
>  completely anemic CPU's is not very helpful, however.
>
>  Assuming that you are only concerned with purchase cost, I would tend to
>  recommend single CPU, dual core machines with a decently fast 64 bit CPU
>  (opteron or xeon), each with 500GB drives.  Depending on your luck, you may
>  be able to get dual CPU's for 4 cores per box for a similar price.  Getting
>  two slightly smaller disks would probably give you better throughput for
>  very slighly higher cost.
>
>  If you are considering life-cycle costs then you may come up with slightly
>  different configurations due to rack density.  Blades don't generally have
>  very large disks so to get to 5TB, you may require a lot of blades.
>
>  Homoegeneity is not a huge issue.  I have a cluster with 4 pretty hot Xeon
>  cores on some boxes and 2 lousy cores on other boxes and the long tail
>  phenomenon does not come up all that much because the file splits are fine
>  enough that it all works out in the end.
>
>  If this is for learning about parallel processing and if somebody else is
>  paying your power bills you should consider getting used machines or
>  machines from a place like Dell outlet.  Many of the machines that you get
>  that way are considerably lower cost and would provide comparable
>  disk/network bandwidth to new machines.  Ebay has bunches of Dell 1850's for
>  sale, for instance.
>
>
>
>  On 3/5/08 7:16 AM, "S. Nunes" <[EMAIL PROTECTED]> wrote:
>
>  > Hi,
>  >
>  > I'm trying to deploy a small Hadoop cluster for our research lab.
>  > We are in the process of selecting the hardware for this cluster. We
>  > are aiming at a 12 CPU, 5 TB cluster. This is obviously a very rough
>  > estimation.
>  >
>  > I have a few questions and I would greatly appreciate your feedback.
>  >
>  > Which is better, a cluster based on many low performance nodes; or a
>  > cluster with fewer but high performance nodes? For instance, should I
>  > bet on a cluster with 4 nodes (1 CPU + 100 GB each) or on a cluster
>  > with 2 nodes (2 CPU + 200 GB each)?
>  >
>  > What should be considered regarding node homogeneity? I understand
>  > that a very unbalanced cluster would result in a "long tailed"
>  > performance - slower nodes would penalize the overall performance.
>  > However, how critical is that? Do you have performance numbers to
>  > support our decision?
>  >
>  > Finally, do you recommend any specific hardware configuration for
>  > starting a cluster (rack, blade, tower...) ?
>  >
>  > Thanks in advance for your comments,
>  >
>  > --
>  > Sérgio Nunes
>
>

Reply via email to