Jeremy Bensley wrote:
First, I have observed the same behavior as a previous poster from
yesterday who, instead of specifying a file for the URLs to be read
from, must now specify a directory (full path) to which a file
containing the URL list is stored. From the response to that thread I
am gathering that it isn't desired behavior to specify a directory
instead of a file.

A directory is required. For consistency, all inputs and outputs are now directories of files rather than individual files.

Second, and more importantly, I am having issues with task trackers. I
have three machines running task tracker, and a fourth running the job
tracker, and they seem to be talking well. Whenever I try to invoke
crawl using the job tracker, however, all of my task trackers
continually fail with this:

050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
[Fatal Error] :-1:-1: Premature end of file.
050816 134532 SEVERE error parsing conf file:
org.xml.sax.SAXParseException: Premature end of file.
java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
end of file.
        at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
        at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
        at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
        at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
        at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
        at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:319)
        at 
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
        at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
        at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
Caused by: org.xml.sax.SAXParseException: Premature end of file.
        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
        at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
        ... 8 more

Whenever I look at the job.xml file specified by this location, it
turns out that it is a directory, not a file.

drwxrwxr-x  2 jeremy  users 4096 Aug 16 13:45 job.xml

I have not seen this before. If you remove everything in /tmp/nutch, is this reproducible? Are you using NDFS? If not, how are you sharing files between task trackers? Is this on Win32, Linux or what? Are you running the latest mapred code? If your troubles continue, please post your nutch-site.xml and mapred-default.xml.

Doug

Reply via email to