Re: (mapred branch) Job.xml as a directory instead of a file, other issues.

Doug Cutting Tue, 16 Aug 2005 15:04:23 -0700

Jeremy Bensley wrote:

After going through your checklist, I realized that my view on how the
MapReduce function behaves was slightly flawed, as I did not realize
that the temporary storage phase between map and reduce had to be in a
shared location.

The temprorary storage between map and reduce is actually not stored inNDFS, but on node's local disks. But the input (the url file in thiscase) must be shared.

So, my process for running crawl is now:
1. Set up / start NDFS name and data nodes

2. Copy url file into NDFS3. Set up / start job and task trackers

4. run crawl with arguments referencing the NDFS positions of my
inputs and outputs


That looks right to me.

We really need a mapred & ndfs-based tutorial...

The only lasting issue I have is that, whenever I attempt to start a
tasktracker or jobtracker and have the configuration parameters for
mapred specified only in mapred-default.xml, I get the following
error:

050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-default.xml
050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-site.xml
Exception in thread "main" java.lang.RuntimeException: Bad
mapred.job.tracker: local
        at org.apache.nutch.mapred.JobTracker.getAddress(JobTracker.java:245)
        at org.apache.nutch.mapred.TaskTracker.<init>(TaskTracker.java:72)
        at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:609)

It is as if the mapred-default.xml is not being parsed for its
options. If I specify the same options in nutch-site.xml it works just
fine.

The config files are a bit confusing. mapred-default.xml is for stuffthat may be reasonably overidden by applications, while nutch-site.xmlis for stuff that should not be overridden by applications. So the nameof the shared filesystem and of the job tracker should be innutch-site.xml, since they should not be overridden. But, e.g., thedefault number of map and reduce tasks should be in mapred-default.xml,since applications do sometimes change these.

The "local" job tracker should only be used in standaloneconfigurations, when everything runs in the same process. It doesn'tmake sense to start a task tracker process configured with a "local" jobtracker. If you want to run them on the same host then you mightconfigure "localhost:xxxx" as the job tracker.


Doug

Re: (mapred branch) Job.xml as a directory instead of a file, other issues.

Reply via email to