Jeremy Bensley wrote:
After going through your checklist, I realized that my view on how the
MapReduce function behaves was slightly flawed, as I did not realize
that the temporary storage phase between map and reduce had to be in a
shared location.

The temprorary storage between map and reduce is actually not stored in NDFS, but on node's local disks. But the input (the url file in this case) must be shared.

So, my process for running crawl is now:
1. Set up / start NDFS name and data nodes
2. Copy url file into NDFS 3. Set up / start job and task trackers
4. run crawl with arguments referencing the NDFS positions of my
inputs and outputs

That looks right to me.

We really need a mapred & ndfs-based tutorial...

The only lasting issue I have is that, whenever I attempt to start a
tasktracker or jobtracker and have the configuration parameters for
mapred specified only in mapred-default.xml, I get the following
error:

050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-default.xml
050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-site.xml
Exception in thread "main" java.lang.RuntimeException: Bad
mapred.job.tracker: local
        at org.apache.nutch.mapred.JobTracker.getAddress(JobTracker.java:245)
        at org.apache.nutch.mapred.TaskTracker.<init>(TaskTracker.java:72)
        at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:609)

It is as if the mapred-default.xml is not being parsed for its
options. If I specify the same options in nutch-site.xml it works just
fine.

The config files are a bit confusing. mapred-default.xml is for stuff that may be reasonably overidden by applications, while nutch-site.xml is for stuff that should not be overridden by applications. So the name of the shared filesystem and of the job tracker should be in nutch-site.xml, since they should not be overridden. But, e.g., the default number of map and reduce tasks should be in mapred-default.xml, since applications do sometimes change these.

The "local" job tracker should only be used in standalone configurations, when everything runs in the same process. It doesn't make sense to start a task tracker process configured with a "local" job tracker. If you want to run them on the same host then you might configure "localhost:xxxx" as the job tracker.

Doug

Reply via email to