On 10/02/11 22:25, Michael Segel wrote:

Shrinivas,

Assuming you're in the US, I'd recommend the following:

Go with 2TB 7200 SATA hard drives.
(Not sure what type of hardware you have)

What  we've found is that in the data nodes, there's an optimal configuration 
that balances price versus performance.

While your chasis may hold 8 drives, how many open SATA ports are on the 
motherboard? Since you're using JBOD, you don't want the additional expense of 
having to purchase a separate controller card for the additional drives.


I'm not going to disagree about cost, but I will note that a single controller can become a bottleneck once you add a lot of disks to it; it generates lots of interrupts that go to the came core, which then ends up at 100% CPU and overloading. With two controllers the work can get spread over two CPUs, moving the bottlenecks back into the IO channels.

For that reason I'd limit the #of disks for a single controller at around 4-6.

Remember as well as storage capacity, you need disk space for logs, spill space, temp dirs, etc. This is why 2TB HDDs are looking appealing these days

Speed? 10K RPM has a faster seek time and possibly bandwidth but you pay in capital and power. If the HDFS blocks are laid out well, seek time isn't so important, so consider saving the money and putting it elsewhere.

The other big question with Hadoop is RAM and CPU, and the answer there is "it depends". RAM depends on the algorithm, as can the CPU:spindle ratio ... I recommend 1 core to 1 spindle as a good starting point. In a large cluster the extra capital costs of a second CPU compared to the amount of extra servers and storage that you could get for the same money speaks in favour of more servers, but in smaller clusters the spreadsheets say different things.

-Steve

(disclaimer, I work for a server vendor :)

Reply via email to