Jason,

On Sat, Aug 18, 2007 at 10:53:39AM -0500, jason gessner wrote:
>
>For small lookups, i can obviously add hashes and other support code
>to my map job to perform the lookups.  When i have millions of
>possible lookups, though, what kind of best practices are there for
>doing lookups of that size?
>
>Should they be data join jobs?  Can berkely dbs or other self
>contained dbs be distributed to each of the nodes?
>
>what have your experiences been with these kinds of lookups?
>

There is a distributed file-cache you can use: 
http://lucene.apache.org/hadoop/api/org/apache/hadoop/filecache/DistributedCache.html

Other than that clearly you could use an external database (mysql/oracle/bdb) 
or webservice calls; clearly application specific. I do know folks doing 
lookups via webservice calls.

Arun

>-jason
>
>(ps - big thanks to all the hadoop folks.  I am having a blast using
>the toolkit. )

Reply via email to