Re: Help with MapReduce

Doug Cutting Thu, 25 May 2006 10:11:56 -0700

Dennis Kubes wrote:

Ok. This is a little different in that I need to start thinking aboutmy algorithms in terms of sequential passes and multiple jobs instead ofdirect access. That way I can use the input directories to get the datathat I need. Couldn't I also do it through the MapRunnable interfacethat creates a reader shared by an inner mapper class or is that hackingthe interfaces when I should be thinking about this terms of sequentialprocessing?

You can do it however you like! I don't know enough about your problemto say definitively which is the best approach. We're working hard onHadoop so that we can scalably stream data through MapReduce atmegabytes/second per node. So you might do some back-of-the envelopecalculations. Figure at least 10ms per random access. So your maximumrandom access rate might be around 100/second per drive. Figure a10MB/second transfer rate, so if randomly accessed data is 100kB each,then your maximum random access rate drops to 50 items/drive/second.Since these are over the network, real performance will probably be muchworse. Also, MapFile requires a scan per entry, so you might really endup scanning 1MB per access, which would slow random accesses to 10items/drive/second. You might benchmark your random accesss performanceto get a better estimate, then compare that to processing the wholecollection through MapReduce.


Doug

Re: Help with MapReduce

Reply via email to