Look for the slide show on Nutch and Hadoop. http://wiki.apache.org/lucene-hadoop/HadoopPresentations
open the one called "Scalable Computing with Hadoop (Doug Cutting, May 2006)" On 10/20/07 1:53 PM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote: > Hi, > > I have been studying map reduce and hadoop for the past few weeks, and > found it a very new concept. While I have a grasp of the map reduce > process as well as being able to follow some of the example code, I > still feel at a loss when it comes to creating my own exercise > "project" and would appreciate any inputs and help on that. > > The project I am having in mind is to leech several (hundred) HTML > files from a website, and use hadoop to index the words of each page > so they can be later searched. However, in all examples I have seen > so far, the data are split into HDFS prior to the execution of the > job. > > Here is the set of questions I have: > > 1. Is CopyFiles.HTTPCopyFilesMapper and/or ServerAddress what I need > for this project > > 2. If so, are there any detailed documentations/examples on these classes? > > 3. If not, could you please let me know conceptually how you would go > about doing this? > > 3. If data must be split beforehand, do I must manually retrieve all > the webpages and load them into HDFS? or do I list the URLs of the > webpages into a text file and split this file instead? > > As you can see, I am very confused at this point and would greatly > appreciate all the help I could get. Thanks! > > -- Jim
