hi all. i have a job where my map will be transforming files and throwing out malformed records, etc. Another step in this job is to perform lookups based on certain fields in the records. Think parent records from an RDBMS.
example: object id --- view details --- source becomes grandparentid --- parentid --- objectid --- view details --- source For small lookups, i can obviously add hashes and other support code to my map job to perform the lookups. When i have millions of possible lookups, though, what kind of best practices are there for doing lookups of that size? Should they be data join jobs? Can berkely dbs or other self contained dbs be distributed to each of the nodes? what have your experiences been with these kinds of lookups? -jason (ps - big thanks to all the hadoop folks. I am having a blast using the toolkit. )
