I am writing a mapreduce job for converting web pages in attributes such as
terms, ngrams, domains, regexs etc. These attributes terms, ngrams, domains
etc are kept in seperate files and are pretty big files close to about 500M
in total. All these files will be used by each mapper for converting a web
page into its attributes. The process is basically if a term in file is also
in web page then that attribute is passed to reducer. Process is called
feature extraction in machine learning. I am wondering what is the best way
to access these files from mappers. Should I store them on hdfs and open and
read these files inside all mappers or should I package these inside job
jar. I appreciate your help and thanks for the suggestions.

Vipul Sharma

Reply via email to