I wasn't really asking about better approach. I was more interested in
how you are thinking about Hadoop in terms of problem solving. I see
now that it is more about sustained throughput (I should have picked
that one up from the GFS paper) and that algorithms need to be coded for
sustained throughput. This is a different type of thinking then coding
an algorithm for a single machine so I am learning as I go. Thanks for
your help.
Dennis
Doug Cutting wrote:
Dennis Kubes wrote:
Ok. This is a little different in that I need to start thinking
about my algorithms in terms of sequential passes and multiple jobs
instead of direct access. That way I can use the input directories
to get the data that I need. Couldn't I also do it through the
MapRunnable interface that creates a reader shared by an inner mapper
class or is that hacking the interfaces when I should be thinking
about this terms of sequential processing?
You can do it however you like! I don't know enough about your
problem to say definitively which is the best approach. We're working
hard on Hadoop so that we can scalably stream data through MapReduce
at megabytes/second per node. So you might do some back-of-the
envelope calculations. Figure at least 10ms per random access. So
your maximum random access rate might be around 100/second per drive.
Figure a 10MB/second transfer rate, so if randomly accessed data is
100kB each, then your maximum random access rate drops to 50
items/drive/second. Since these are over the network, real performance
will probably be much worse. Also, MapFile requires a scan per entry,
so you might really end up scanning 1MB per access, which would slow
random accesses to 10 items/drive/second. You might benchmark your
random accesss performance to get a better estimate, then compare
that to processing the whole collection through MapReduce.
Doug