Jeremy Bensley wrote:
First, I have observed the same behavior as a previous poster from
yesterday who, instead of specifying a file for the URLs to be read
from, must now specify a directory (full path) to which a file
containing the URL list is stored. From the response to that thread I
am gathering that it isn't desired behavior to specify a directory
instead of a file.
A directory is required. For consistency, all inputs and outputs are
now directories of files rather than individual files.
Second, and more importantly, I am having issues with task trackers. I
have three machines running task tracker, and a fourth running the job
tracker, and they seem to be talking well. Whenever I try to invoke
crawl using the job tracker, however, all of my task trackers
continually fail with this:
050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
[Fatal Error] :-1:-1: Premature end of file.
050816 134532 SEVERE error parsing conf file:
org.xml.sax.SAXParseException: Premature end of file.
java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
end of file.
at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
at
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
at
org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:319)
at
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
Caused by: org.xml.sax.SAXParseException: Premature end of file.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
... 8 more
Whenever I look at the job.xml file specified by this location, it
turns out that it is a directory, not a file.
drwxrwxr-x 2 jeremy users 4096 Aug 16 13:45 job.xml
I have not seen this before. If you remove everything in /tmp/nutch, is
this reproducible? Are you using NDFS? If not, how are you sharing
files between task trackers? Is this on Win32, Linux or what? Are you
running the latest mapred code? If your troubles continue, please post
your nutch-site.xml and mapred-default.xml.
Doug