On Apr 7, 2010, at 10:50 PM, James Seigel wrote: > I am new to this group, and relatively new to hadoop.
Welcome to the community, James. :) > I am looking at building a large cluster. I was wondering if anyone has any > best practices for a cluster in the hundreds of nodes? Take a look at the 'Hadoop 24/7' presentation (on the hadoop wiki preso page) I did for ApacheCon EU last year. It covers a lot of the "now that I have a grid, what do I do?" situations. > As well, has anyone had experience with a cluster spanning multiple data > centers. Is this a bad practice? moderately bad practice? insane? Right now, it generally falls into the insane category unless you have REALLY REALLY REALLY low latency and high bandwidth. The heartbeats between nodes, issues with block placement, etc, make it highly likely to saturate the link and/or split the cluster in multiple pieces. > Is it better to build the 1000 node cluster in a single data center? Do you > back one of these things up to a second data center or a different 1000 node > cluster? We're currently going with a 'multiple grids in one data center' strategy. Our 'Source of Truth' data is from another source, meaning we could (theoretically) rebuild the grid from that source if we were to get decimated by dinosaurs. [That source of truth has a much better backup/dr strategy.] > Sorry, I am asking crazy questions...I am just wanting to learn the meta > issues and opportunities with making clusters. These are pretty normal questions. We should probably create a faq or something on the wiki.
