brute force: let the input be splittable. in each map job, open the original 
file and for each line in the split, iterate over all preceding lines in the 
input file. this will at least get u the parallelism.

but a better approach would be try and cast ur problem as a sorting/grouping 
problem. do all lines really have to be compared against each other? Or is it 
possible to bucketize lines that might match?  (and then map reduce to group 
lines and then do the matching within reduce).


-----Original Message-----
From: Chris Fellows [mailto:[EMAIL PROTECTED]
Sent: Fri 12/14/2007 10:41 AM
To: hadoop-user@lucene.apache.org
Subject: advanced map/reduce tutorials?
 
Hello,

The map/reduce tutorials in the hadoop src are great for getting started. Are 
there any similar tutorials for more advanced use cases? Especially complicated 
ones that might involve subclassing RecordReader, InputFormat, and others. 

In particular I want to write a job that does a cartesian product of a file, 
i.e. it takes each row in the file and compares it against every other row in 
the file. My first pass involved writing a NonSplittableInputFormat and a 
RecordReader that composes 2 LineRecordReaders, one outerReader and one 
innerReader. This returns two rows merged into one to the Map task which does 
the comparison. 

Seems there must be a better way to do this. Additionally, no matter how many 
map tasks I assign, only one map task gets created and assigned by the job 
tracker. Any ideas on a better approach? Has anyone done anything similar?

Thanks!


Reply via email to