Jason, On Sat, Aug 18, 2007 at 10:53:39AM -0500, jason gessner wrote: > >For small lookups, i can obviously add hashes and other support code >to my map job to perform the lookups. When i have millions of >possible lookups, though, what kind of best practices are there for >doing lookups of that size? > >Should they be data join jobs? Can berkely dbs or other self >contained dbs be distributed to each of the nodes? > >what have your experiences been with these kinds of lookups? >
There is a distributed file-cache you can use: http://lucene.apache.org/hadoop/api/org/apache/hadoop/filecache/DistributedCache.html Other than that clearly you could use an external database (mysql/oracle/bdb) or webservice calls; clearly application specific. I do know folks doing lookups via webservice calls. Arun >-jason > >(ps - big thanks to all the hadoop folks. I am having a blast using >the toolkit. )
