I have been attempting to get the mapred branch version of the crawler
working and have hit some snags.

First, I have observed the same behavior as a previous poster from
yesterday who, instead of specifying a file for the URLs to be read
from, must now specify a directory (full path) to which a file
containing the URL list is stored. From the response to that thread I
am gathering that it isn't desired behavior to specify a directory
instead of a file.

Second, and more importantly, I am having issues with task trackers. I
have three machines running task tracker, and a fourth running the job
tracker, and they seem to be talking well. Whenever I try to invoke
crawl using the job tracker, however, all of my task trackers
continually fail with this:

050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
[Fatal Error] :-1:-1: Premature end of file.
050816 134532 SEVERE error parsing conf file:
org.xml.sax.SAXParseException: Premature end of file.
java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
end of file.
        at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
        at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
        at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
        at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
        at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
        at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:319)
        at 
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
        at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
        at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
Caused by: org.xml.sax.SAXParseException: Premature end of file.
        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
        at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
        ... 8 more

Whenever I look at the job.xml file specified by this location, it
turns out that it is a directory, not a file.

drwxrwxr-x  2 jeremy  users 4096 Aug 16 13:45 job.xml


Any help / observation of these issues is most appreciated.

Thanks,

Jeremy

Reply via email to