Re: Help with MapReduce

Dennis Kubes Thu, 25 May 2006 09:58:33 -0700

Ok. This is a little different in that I need to start thinking aboutmy algorithms in terms of sequential passes and multiple jobs instead ofdirect access. That way I can use the input directories to get the datathat I need. Couldn't I also do it through the MapRunnable interfacethat creates a reader shared by an inner mapper class or is that hackingthe interfaces when I should be thinking about this terms of sequentialprocessing?


Dennis


Doug Cutting wrote:

Dennis Kubes wrote:
The problem is that I have a single url. I get the inlinks to thaturl and then I need to go access content from all of its inlink urlsthat have been fetched.I was doing this through Random access. But then I went back andre-read the google MapReduce paper and saw that it was designed forSequential access and saw that Hadoop implements the same way. Butso far I haven't found a way to efficiently solve this kind ofproblem in sequential format.
If your input urls are only a small fraction of the collection, thenrandom access might be appropriate, or you might instead use two (ormore) MapReduce passes, something like:
1. url -> inlink urls (using previously inverted link db)
2. inlink urls -> inlink content
In each case the mapping might look like it's doing random access,but, if input keys are sorted, and the "table" you're "selecting" from(the link db in the first case and the content in the second) aresorted, then the accesses will actually be sequential, scanning eachtable only once. But these will generally be remote DFS accesses.MapReduce can usually arrange to place tasks on a node where the inputdata is local, but when the map task then accesses other files thisoptimization cannot be made.
In Nutch, things are slightly more complicated, since the content isorganized by segment, each sorted by URL. So you could either addanother MapReduce pass so that the inlink urls are sorted by segmentthen url, or you could append all of your segments into a single segment.
But if you're performing the calculation over the entire collection,or even a substantial fraction, then you might be able to use a singleMapReduce pass, with the content and link db as inputs, performingyour required computations in reduce. For anything larger than asmall fraction of your collection this will likely be fastest.
If I were to do it in the configure and close wouldn't that stillopen a single reader per map call?
configure() and close() are only called once per map task.

Doug

Re: Help with MapReduce

Reply via email to