Jeremy Bensley wrote:
After going through your checklist, I realized that my view on how the
MapReduce function behaves was slightly flawed, as I did not realize
that the temporary storage phase between map and reduce had to be in a
shared location.
The temprorary storage between map and reduce is actually not stored in
NDFS, but on node's local disks. But the input (the url file in this
case) must be shared.
So, my process for running crawl is now:
1. Set up / start NDFS name and data nodes
2. Copy url file into NDFS
3. Set up / start job and task trackers
4. run crawl with arguments referencing the NDFS positions of my
inputs and outputs
That looks right to me.
We really need a mapred & ndfs-based tutorial...
The only lasting issue I have is that, whenever I attempt to start a
tasktracker or jobtracker and have the configuration parameters for
mapred specified only in mapred-default.xml, I get the following
error:
050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-default.xml
050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-site.xml
Exception in thread "main" java.lang.RuntimeException: Bad
mapred.job.tracker: local
at org.apache.nutch.mapred.JobTracker.getAddress(JobTracker.java:245)
at org.apache.nutch.mapred.TaskTracker.<init>(TaskTracker.java:72)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:609)
It is as if the mapred-default.xml is not being parsed for its
options. If I specify the same options in nutch-site.xml it works just
fine.
The config files are a bit confusing. mapred-default.xml is for stuff
that may be reasonably overidden by applications, while nutch-site.xml
is for stuff that should not be overridden by applications. So the name
of the shared filesystem and of the job tracker should be in
nutch-site.xml, since they should not be overridden. But, e.g., the
default number of map and reduce tasks should be in mapred-default.xml,
since applications do sometimes change these.
The "local" job tracker should only be used in standalone
configurations, when everything runs in the same process. It doesn't
make sense to start a task tracker process configured with a "local" job
tracker. If you want to run them on the same host then you might
configure "localhost:xxxx" as the job tracker.
Doug