Thanks Ted. While the slides indeed give me valuable insights on the project I have in mind, I would still like to see some detailed examples/documentations on the different mappers and reducers that come with hadoop. Do you happen to know where I can find such texts? Thanks.
-- Jim On 10/20/07, Ted Dunning <[EMAIL PROTECTED]> wrote: > > Look for the slide show on Nutch and Hadoop. > > http://wiki.apache.org/lucene-hadoop/HadoopPresentations > > open the one called "Scalable Computing with Hadoop (Doug Cutting, May > 2006)" > > > On 10/20/07 1:53 PM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I have been studying map reduce and hadoop for the past few weeks, and > > found it a very new concept. While I have a grasp of the map reduce > > process as well as being able to follow some of the example code, I > > still feel at a loss when it comes to creating my own exercise > > "project" and would appreciate any inputs and help on that. > > > > The project I am having in mind is to leech several (hundred) HTML > > files from a website, and use hadoop to index the words of each page > > so they can be later searched. However, in all examples I have seen > > so far, the data are split into HDFS prior to the execution of the > > job. > > > > Here is the set of questions I have: > > > > 1. Is CopyFiles.HTTPCopyFilesMapper and/or ServerAddress what I need > > for this project > > > > 2. If so, are there any detailed documentations/examples on these classes? > > > > 3. If not, could you please let me know conceptually how you would go > > about doing this? > > > > 3. If data must be split beforehand, do I must manually retrieve all > > the webpages and load them into HDFS? or do I list the URLs of the > > webpages into a text file and split this file instead? > > > > As you can see, I am very confused at this point and would greatly > > appreciate all the help I could get. Thanks! > > > > -- Jim > > -- -------------------------------------- Standing Bear Has Spoken --------------------------------------
