Hi, I have been studying map reduce and hadoop for the past few weeks, and found it a very new concept. While I have a grasp of the map reduce process as well as being able to follow some of the example code, I still feel at a loss when it comes to creating my own exercise "project" and would appreciate any inputs and help on that.
The project I am having in mind is to leech several (hundred) HTML files from a website, and use hadoop to index the words of each page so they can be later searched. However, in all examples I have seen so far, the data are split into HDFS prior to the execution of the job. Here is the set of questions I have: 1. Is CopyFiles.HTTPCopyFilesMapper and/or ServerAddress what I need for this project 2. If so, are there any detailed documentations/examples on these classes? 3. If not, could you please let me know conceptually how you would go about doing this? 3. If data must be split beforehand, do I must manually retrieve all the webpages and load them into HDFS? or do I list the URLs of the webpages into a text file and split this file instead? As you can see, I am very confused at this point and would greatly appreciate all the help I could get. Thanks! -- Jim
