This is exactly what I wanted. Thanks Koji! On Wed, Jan 12, 2011 at 12:57 PM, Koji Noguchi <knogu...@yahoo-inc.com>wrote:
> > http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html#DistributedCache > > Packaging inside a job jar would work but it would not get shared for > multiple jobs. > Using distributed cache, it would localize the copy on the tasktracker > nodes and get shared among multiple jobs. > > Koji > > > > > > On 1/12/11 12:51 PM, "vipul sharma" <sharmavipulw...@gmail.com> wrote: > > I am writing a mapreduce job for converting web pages in attributes such as > terms, ngrams, domains, regexs etc. These attributes terms, ngrams, domains > etc are kept in seperate files and are pretty big files close to about 500M > in total. All these files will be used by each mapper for converting a web > page into its attributes. The process is basically if a term in file is also > in web page then that attribute is passed to reducer. Process is called > feature extraction in machine learning. I am wondering what is the best way > to access these files from mappers. Should I store them on hdfs and open and > read these files inside all mappers or should I package these inside job > jar. I appreciate your help and thanks for the suggestions. > > Vipul Sharma > > -- Vipul Sharma sharmavipul AT gmail DOT com