I am writing a mapreduce job for converting web pages in attributes such as terms, ngrams, domains, regexs etc. These attributes terms, ngrams, domains etc are kept in seperate files and are pretty big files close to about 500M in total. All these files will be used by each mapper for converting a web page into its attributes. The process is basically if a term in file is also in web page then that attribute is passed to reducer. Process is called feature extraction in machine learning. I am wondering what is the best way to access these files from mappers. Should I store them on hdfs and open and read these files inside all mappers or should I package these inside job jar. I appreciate your help and thanks for the suggestions.
Vipul Sharma