I wasn't really asking about better approach. I was more interested in how you are thinking about Hadoop in terms of problem solving. I see now that it is more about sustained throughput (I should have picked that one up from the GFS paper) and that algorithms need to be coded for sustained throughput. This is a different type of thinking then coding an algorithm for a single machine so I am learning as I go. Thanks for your help.

Dennis

Doug Cutting wrote:
Dennis Kubes wrote:
Ok. This is a little different in that I need to start thinking about my algorithms in terms of sequential passes and multiple jobs instead of direct access. That way I can use the input directories to get the data that I need. Couldn't I also do it through the MapRunnable interface that creates a reader shared by an inner mapper class or is that hacking the interfaces when I should be thinking about this terms of sequential processing?

You can do it however you like! I don't know enough about your problem to say definitively which is the best approach. We're working hard on Hadoop so that we can scalably stream data through MapReduce at megabytes/second per node. So you might do some back-of-the envelope calculations. Figure at least 10ms per random access. So your maximum random access rate might be around 100/second per drive. Figure a 10MB/second transfer rate, so if randomly accessed data is 100kB each, then your maximum random access rate drops to 50 items/drive/second. Since these are over the network, real performance will probably be much worse. Also, MapFile requires a scan per entry, so you might really end up scanning 1MB per access, which would slow random accesses to 10 items/drive/second. You might benchmark your random accesss performance to get a better estimate, then compare that to processing the whole collection through MapReduce.

Doug

Reply via email to