Hi Tejas, On Sat, Jan 4, 2014 at 8:01 AM, <[email protected]> wrote:
> > I realized that by using MultipleInputs, we can read CrawlDatum objects > from crawldb and urls from seeds file simultaneously and perform inject in > a single map-reduce job. PFA Injector2.java which is an implementation of > this approach. I did some basic testing on it and so far I have not > encountered any problems. > Dynamite Tejas. I would kindly ask that you open an issue and apply your patch against trunk :) > > I am not sure why Injector was not written this way which is more > efficient than the one currently in trunk (maybe MultipleInputs was later > added in Hadoop). > As far as I have discovered, join's have been available in Hadoop's mapred package and subsequently in mapreduce package so it may not be a case of them not being available... however this goes to no length to explain why the Injector was not written in this way. Wondering if I am wrong somewhere in my understanding. Any comments about > this ? > > I am curious to discover how more efficient using the MultipleInputs's class is over the sequential MR jobs as is currently implemented. Do you have any comparison on the size of the dataset being used? There is a script [0] I keep on my github which we can test this against (1M URLs). This would provide a reasonable input dataset which we could use to base some efficiency tests on. Great observations Tejas. Lewis [0] https://github.com/lewismc/nipt

