Dennis Kubes wrote:
Ok. This is a little different in that I need to start thinking about my algorithms in terms of sequential passes and multiple jobs instead of direct access. That way I can use the input directories to get the data that I need. Couldn't I also do it through the MapRunnable interface that creates a reader shared by an inner mapper class or is that hacking the interfaces when I should be thinking about this terms of sequential processing?
You can do it however you like! I don't know enough about your problem to say definitively which is the best approach. We're working hard on Hadoop so that we can scalably stream data through MapReduce at megabytes/second per node. So you might do some back-of-the envelope calculations. Figure at least 10ms per random access. So your maximum random access rate might be around 100/second per drive. Figure a 10MB/second transfer rate, so if randomly accessed data is 100kB each, then your maximum random access rate drops to 50 items/drive/second. Since these are over the network, real performance will probably be much worse. Also, MapFile requires a scan per entry, so you might really end up scanning 1MB per access, which would slow random accesses to 10 items/drive/second. You might benchmark your random accesss performance to get a better estimate, then compare that to processing the whole collection through MapReduce.
Doug
