After going through your checklist, I realized that my view on how the
MapReduce function behaves was slightly flawed, as I did not realize
that the temporary storage phase between map and reduce had to be in a
shared location. So, my process for running crawl is now:

1. Set up / start NDFS name and data nodes
2. Copy url file into NDFS 
3. Set up / start job and task trackers
4. run crawl with arguments referencing the NDFS positions of my
inputs and outputs

Following these steps I was able to get it to work as expected.


The only lasting issue I have is that, whenever I attempt to start a
tasktracker or jobtracker and have the configuration parameters for
mapred specified only in mapred-default.xml, I get the following
error:

050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-default.xml
050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-site.xml
Exception in thread "main" java.lang.RuntimeException: Bad
mapred.job.tracker: local
        at org.apache.nutch.mapred.JobTracker.getAddress(JobTracker.java:245)
        at org.apache.nutch.mapred.TaskTracker.<init>(TaskTracker.java:72)
        at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:609)

It is as if the mapred-default.xml is not being parsed for its
options. If I specify the same options in nutch-site.xml it works just
fine.

I appreciate the help, and look forward to experimenting with the software.

Jeremy


On 8/16/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Jeremy Bensley wrote:
> > First, I have observed the same behavior as a previous poster from
> > yesterday who, instead of specifying a file for the URLs to be read
> > from, must now specify a directory (full path) to which a file
> > containing the URL list is stored. From the response to that thread I
> > am gathering that it isn't desired behavior to specify a directory
> > instead of a file.
> 
> A directory is required.  For consistency, all inputs and outputs are
> now directories of files rather than individual files.
> 
> > Second, and more importantly, I am having issues with task trackers. I
> > have three machines running task tracker, and a fourth running the job
> > tracker, and they seem to be talking well. Whenever I try to invoke
> > crawl using the job tracker, however, all of my task trackers
> > continually fail with this:
> >
> > 050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
> > [Fatal Error] :-1:-1: Premature end of file.
> > 050816 134532 SEVERE error parsing conf file:
> > org.xml.sax.SAXParseException: Premature end of file.
> > java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
> > end of file.
> >         at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
> >         at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
> >         at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
> >         at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
> >         at 
> > org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
> >         at 
> > org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:319)
> >         at 
> > org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
> >         at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
> >         at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
> > Caused by: org.xml.sax.SAXParseException: Premature end of file.
> >         at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> >         at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> >         at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
> >         at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
> >         ... 8 more
> >
> > Whenever I look at the job.xml file specified by this location, it
> > turns out that it is a directory, not a file.
> >
> > drwxrwxr-x  2 jeremy  users 4096 Aug 16 13:45 job.xml
> 
> I have not seen this before.  If you remove everything in /tmp/nutch, is
> this reproducible?  Are you using NDFS?  If not, how are you sharing
> files between task trackers?  Is this on Win32, Linux or what?  Are you
> running the latest mapred code?  If your troubles continue, please post
> your nutch-site.xml and mapred-default.xml.
> 
> Doug
>

Reply via email to