If your "commodity" pc's don't have a whole lot of storage space, then you would have to run your HDFS datanodes elsewhere. In that case, a lot of data traffic will occur (e.g. sending data from datanodes to where data processing occurs), meaning map reduce performance will be slowed down. It's always good to have the actual data on the same machine where the processing will occur, or there will be extra network i/o involved.
If you decide to host datanodes on pc's, then you also have to be able to protect the data. (e.g. make sure people don't accidentally delete data blocks.) Well, there are lots and lots of possibilities, and I would like to hear how your plan goes, too! On Tue, Sep 29, 2009 at 12:45 PM, James Carroll <[email protected] > wrote: > I work in a call center which means we have a lot of PCs sitting on > agents' desks doing a whole lot nothing in the middle of the night. It > also means that we collect a lot of phone and other data, etc that all > gets rolled out into reports and/or tables that drive reports or other > processes. We're pushing the limits on what our current data processing > can do and I'd like to pitch Hadoop/HDFS/PIG to my boss. So bottomline, > before I go too much further: can we create a Hadoop cluster across all > those desktop PCs, start/wake it up once every one has gone home, load > the data, do the analysis, and then creep back into the shadows before > anyone is the wiser? Or would the slave nodes have to be 'dedicated' > such that they wouldn't be able to do anything other that. We'll figure > out the capacity aspects later if I can get a Proof of Concept approved > to at least try. The PCs are, you guessed it, Windows machines. > > Thanks! > >
