newbie seeking inputs and help

Jim the Standing Bear Sat, 20 Oct 2007 13:53:49 -0700

Hi,

I have been studying map reduce and hadoop for the past few weeks, and
found it a very new concept.  While I have a grasp of the map reduce
process as well as being able to follow some of the example code, I
still feel at a loss when it comes to creating my own exercise
"project" and would appreciate any inputs and help on that.


The project I am having in mind is to leech several (hundred) HTML
files from a website, and use hadoop to index the words of each page
so they can be later searched.  However, in all examples I have seen
so far, the data are split into HDFS prior to the execution of the
job.

Here is the set of questions I have:

1. Is CopyFiles.HTTPCopyFilesMapper and/or ServerAddress what I need
for this project

2. If so, are there any detailed documentations/examples on these classes?

3. If not, could you please let me know conceptually how you would go
about doing this?

3. If data must be split beforehand, do I must manually retrieve all
the webpages and load them into HDFS?  or do I list the URLs of the
webpages into a text file and split this file instead?

As you can see, I am very confused at this point and would greatly
appreciate all the help I could get.  Thanks!

-- Jim

newbie seeking inputs and help

Reply via email to