Hi Dan, You could do one few things to get around this. 1. In a subsequent step you could merge all your MapFile outputs into one file. This is if your MapFile output is small. 2. Else, you can use the same partition function which hadoop used to find the partition ID. Partition ID can tell you which output file (out of the 150 files) your key is present in. Eg. if the partition ID was 23, then the output file you would have to look for would be part-00023 in the generated output.
You can use your own Partition class (make sure you use it for your first job as well as second) or reuse the one already used by Hadoop. http://hadoop.apache.org/core/docs/r0.18.2/api/org/apache/hadoop/mapred/Partitioner.html has details. I think this http://hadoop.apache.org/core/docs/r0.18.2/api/org/apache/hadoop/examples/SleepJob.html has its usage example. (look for SleepJob.java) -Lohit ----- Original Message ---- From: Dan Benjamin <[EMAIL PROTECTED]> To: [email protected] Sent: Tuesday, November 18, 2008 10:53:47 AM Subject: Performing a Lookup in Multiple MapFiles? I've got a Hadoop process that creates as its output a MapFile. Using one reducer this is very slow (as the map is large), but with 150 (on a cluster of 80 nodes) it runs quickly. The problem is that it produces 150 output files as well. In a subsequent process I need to perform lookups on this map - how is it recommended that I do this, given that I may not know the number of existing MapFiles or their names? Is there a cleaner solution than listing the contents of the directory containing all of the MapFiles and then just querying each in sequence? -- View this message in context: http://www.nabble.com/Performing-a-Lookup-in-Multiple-MapFiles--tp20565940p20565940.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
