Can someone help explain in a little more detail some of the reasons for the hardware specs that were recently added to the wiki for the NameNode. I guess i'm interested in learning how others have settled on these specs? Is it by observed behavior, or just recommended by other hadoop users?

- Use a good server with lots (15GB+) of RAM.
- why 15+ GBs? Do we allocate all memory to the NameNode? or just allocate some number using -Xmx and leave the rest available so the machine doesnt start swapping?

- Consider using fast RAID5 storage for keeping the index.
     - why RAID5?

- List more than one name node directory in the configuration, so that multiple copies of the indices will be stored. As long as the directories are on separate disks, a single full disk will not corrupt the index.
     - If running RAID 5, why is this necessary?

- Configure the name node to store one set of transaction logs on a separate disk from the index.
     - why?

- Configure the name node to store another set of transaction logs to a network mounted disk.
     - why?

- Do not host DataNode, JobTracker or TaskTracker services on the same system. - how much memory would the job tracker need? Does it use a lot of CPU? In general, what are good specs for a job tracker machine and can the machine be shared with other services?

Thanks so much for the help. I think it would be hugely helpful for the community to start describing their respective setups for hadoop clusters in more detail than just the config for datanodes and cluster size. I think we all want to be confident that we are spending money on the right machines to grow our cluster the right way.


Most appreciated,

- Manish
Co-Founder Rapleaf.com

We're looking for a product manager, sys admin, and software engineers...$10K referral award

Reply via email to