Hello, The map/reduce tutorials in the hadoop src are great for getting started. Are there any similar tutorials for more advanced use cases? Especially complicated ones that might involve subclassing RecordReader, InputFormat, and others.
In particular I want to write a job that does a cartesian product of a file, i.e. it takes each row in the file and compares it against every other row in the file. My first pass involved writing a NonSplittableInputFormat and a RecordReader that composes 2 LineRecordReaders, one outerReader and one innerReader. This returns two rows merged into one to the Map task which does the comparison. Seems there must be a better way to do this. Additionally, no matter how many map tasks I assign, only one map task gets created and assigned by the job tracker. Any ideas on a better approach? Has anyone done anything similar? Thanks!