Re: Help with MapReduce

Dennis Kubes Thu, 25 May 2006 10:26:50 -0700

I wasn't really asking about better approach. I was more interested inhow you are thinking about Hadoop in terms of problem solving. I seenow that it is more about sustained throughput (I should have pickedthat one up from the GFS paper) and that algorithms need to be coded forsustained throughput. This is a different type of thinking then codingan algorithm for a single machine so I am learning as I go. Thanks foryour help.


Dennis


Doug Cutting wrote:

Dennis Kubes wrote:
Ok. This is a little different in that I need to start thinkingabout my algorithms in terms of sequential passes and multiple jobsinstead of direct access. That way I can use the input directoriesto get the data that I need. Couldn't I also do it through theMapRunnable interface that creates a reader shared by an inner mapperclass or is that hacking the interfaces when I should be thinkingabout this terms of sequential processing?
You can do it however you like! I don't know enough about yourproblem to say definitively which is the best approach. We're workinghard on Hadoop so that we can scalably stream data through MapReduceat megabytes/second per node. So you might do some back-of-theenvelope calculations. Figure at least 10ms per random access. Soyour maximum random access rate might be around 100/second per drive.Figure a 10MB/second transfer rate, so if randomly accessed data is100kB each, then your maximum random access rate drops to 50items/drive/second. Since these are over the network, real performancewill probably be much worse. Also, MapFile requires a scan per entry,so you might really end up scanning 1MB per access, which would slowrandom accesses to 10 items/drive/second. You might benchmark yourrandom accesss performance to get a better estimate, then comparethat to processing the whole collection through MapReduce.
Doug

Re: Help with MapReduce

Reply via email to