Hi - Yes, MultipleInputs works very well, i did that too when coding the 
HostDB. The MultipleInputs class was not available when the injector was 
originally written, it was introduced around 0.19 or 0.20. I see no reason not 
to replace this so +1 for an new ticket. If unit tests pass, we're good to go.

-----Original message-----
From: Lewis John Mcgibbney<[email protected]>
Sent: Monday 6th January 2014 15:40
To: [email protected]
Subject: Re: Inject operation: can't it be done in a single map-reduce job ?

Hi Tejas,

On Sat, Jan 4, 2014 at 8:01 AM,  <[email protected] 
<mailto:[email protected]>> wrote:

I realized that by using MultipleInputs, we can read CrawlDatum objects from 
crawldb and urls from seeds file simultaneously and perform inject in a single 
map-reduce job. PFA Injector2.java which is an implementation of this approach. 
I did some basic testing on it and so far I have not encountered any problems.

Dynamite Tejas. I would kindly ask that you open an issue and apply your patch 
against trunk :)

I am not sure why Injector was not written this way which is more efficient 
than the one currently in trunk (maybe MultipleInputs was later added in 
Hadoop).

As far as I have discovered, joins have been available in Hadoops mapred 
package and subsequently in mapreduce package so it may not be a case of them 
not being available... however this goes to no length to explain why the 
Injector was not written in this way.

Wondering if I am wrong somewhere in my understanding. Any comments about this ?

I am curious to discover how more efficient using the MultipleInputss class is 
over the sequential MR jobs as is currently implemented. Do you have any 
comparison on the size of the dataset being used?

There is a script [0] I keep on my github which we can test this against (1M 
URLs). This would provide a reasonable input dataset which we could use to base 
some efficiency tests on.

Great observations Tejas.

Lewis

[0] https://github.com/lewismc/nipt <https://github.com/lewismc/nipt>


Reply via email to