This sounds like a good task for the Data Join code.
If you can set up so that all of your data is stored in MapFiles, with
the same type of key and the same partitioning setup and count, it will
go very well.
Mori Bellamy wrote:
Hey Amer,
It sounds to me like you're going to have to write your own input
format (or atleast modify an existing one). Take a look here:
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileSplit.html
I'm not sure how you'd go about doing this, but i hope this helps you.
(Also, have you considered preprocessing your input so that any
arbitrary mapper can know whether or not its looking at a line from
the "large file"?)
On Jul 11, 2008, at 12:31 PM, Muhammad Ali Amer wrote:
HI,
My requirement is to compare the contents of one very large file (GB
to TB size) with a bunch of smaller files (100s of MB to GB sizes).
Is there a way I can give the mapper the 1st file independently of
the remaining bunch?
Amer
--
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if
interested