Grant Ingersoll wrote:
I know Hadoop is separate from Nutch, but I found the Hadoop Tutorial
(http://wiki.apache.org/nutch/NutchHadoopTutorial) on the Nutch Wiki to
be quite informative in filling in some gray areas for me on how to get
Hadoop working, so I was wondering if it is all right to link to this
one, or should some effort be made (by me) to extract the relevant
Hadoop pieces from this link and put them in a new page on the Hadoop
wiki? I know some users may be confused by the talk of Nutch in it.
That does look like a good tutorial, and I don't have a problem with
linking to it from the Hadoop wiki. Or, if you're feeling energetic,
copy it to the Hadoop wiki & remove the Nutch-specifics. Then you might
make the Nutch wiki page link to your page in Hadoop's wiki.
A few notes, however:
1. mapred.map.tasks and mapred.reduce.tasks should be in
mapred-default.xml, not in hadoop-site.xml. Otherwise jobs cannot
override these. Nutch sometimes does override these.
2. Config files now support variables, so that setting just
hadoop.tmp.dir in hadoop-site.xml is usually sufficient, since all other
directories in the defaults are relative to this.
3. When setting HADOOP_MASTER it's usually advisable to set
HADOOP_SLAVE_SLEEP=1, or else the rsyncs can fail.
Doug