Just a followup, i figured out the 3rd exception below ( Exception in
thread "main" java.io.IOException: No input directories specified in:
NutchConf..) so no worries there. but the others are still issues.
Matt Zytaruk wrote:
I've been having a lot of trouble lately with the newest nutch src.
Both my crawls and parses are failing (for our fetches we crawl and
parse at the same time with just the default nutch config, just to get
the outlinks and update the crawldb, but then later on, after the
fetch we do another parse with custom parse filters). Here are the
exceptions below.
This exception happens sometimes when crawling (on the linkdb part of
the crawl):
Exception in thread "main" java.io.IOException: Not a file:
/user/nutch/segments/20060107130328/parse_data/part-00000/data
at org.apache.nutch.ipc.Client.call(Client.java:294)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
at $Proxy1.submitJob(Unknown Source)
at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:131)
We also got this for awhile (seems like the mapred/system dir is never
being created for some reason):
java.io.IOException: Cannot open filename
/nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml
at org.apache.nutch.ipc.Client.call(Client.java:294)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
at $Proxy1.open(Unknown Source)
at
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.openInfo(NDFSClient.java:256)
at
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.<init>(NDFSClient.java:242)
at org.apache.nutch.ndfs.NDFSClient.open(NDFSClient.java:79)
at
org.apache.nutch.fs.NDFSFileSystem.openRaw(NDFSFileSystem.java:66)
at
org.apache.nutch.fs.NFSDataInputStream$Checker.<init>(NFSDataInputStream.java:45)
at
org.apache.nutch.fs.NFSDataInputStream.<init>(NFSDataInputStream.java:221)
at
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160)
at
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:149)
at
org.apache.nutch.fs.NDFSFileSystem.copyToLocalFile(NDFSFileSystem.java:221)
at
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:346)
at
org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:332)
at
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:232)
at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:286)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:651)
Then, on parsing, we got this, within 10 second of the parse starting:
060109 093759 task_m_ltgpnj Error running child
060109 093759 task_m_ltgpnj java.lang.RuntimeException:
java.io.EOFException
060109 093759 task_m_ltgpnj at
org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:57)
060109 093759 task_m_ltgpnj at
org.apache.nutch.protocol.Content.getContent(Content.java:124)
060109 093759 task_m_ltgpnj at
org.apache.nutch.crawl.MD5Signature.calculate(MD5Signature.java:33)
060109 093759 task_m_ltgpnj at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:62)
060109 093759 task_m_ltgpnj at
org.apache.nutch.mapred.MapRunner.run(MapRunner.java:52)
060109 093759 task_m_ltgpnj at
org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
060109 093759 task_m_ltgpnj at
org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)
060109 093759 task_m_ltgpnj Caused by: java.io.EOFException
060109 093759 task_m_ltgpnj at
java.io.DataInputStream.readFully(DataInputStream.java:268)
060109 093759 task_m_ltgpnj at
org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
060109 093759 task_m_ltgpnj at
org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
060109 093759 task_m_ltgpnj at
org.apache.nutch.io.UTF8.readChars(UTF8.java:212)
060109 093759 task_m_ltgpnj at
org.apache.nutch.io.UTF8.readString(UTF8.java:204)
060109 093759 task_m_ltgpnj at
org.apache.nutch.protocol.ContentProperties.readFields(ContentProperties.java:169)
060109 093759 task_m_ltgpnj at
org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:81)
060109 093759 task_m_ltgpnj at
org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:54)
060109 093759 task_m_ltgpnj ... 6 more
060109 093802 task_m_txrnu3 done; removing files.
060109 093802 Server connection on port 50050 from 127.0.0.2: exiting
060109 093805 task_m_ltgpnj done; removing files.
060109 093805 Lost connection to JobTracker
[crawler-d-03.internal.wavefire.ca/127.0.0.2:8050].
ex=java.lang.NullPointerException Retrying...
On a different segment we got this instead:
Exception in thread "main" java.io.IOException: No input directories
specified in: NutchConf: nutch-default.xml , mapred-default.xml ,
/nutch-data/nutch/tmp/nutch/mapred/local/jobTracker/job_tn7u97.xml ,
nutch-site.xml
at org.apache.nutch.ipc.Client.call(Client.java:294)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
at $Proxy0.submitJob(Unknown Source)
at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:95)
at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:113)
(I think you usually get this error when you dont put the right
filenames in arguments, but that is definately not the case here)
These are all tasks on segments which worked fine before we changed
src code (we had been working with the src from about the beginning of
december previously). It's also not a permissions issue as it all
worked fine previously. The only things that have changed are the
updated code and the number of map/reduce tasks in the config (side
note: what is the best number of tasks for each to use? we have a set
of 2 machines that works together to crawl, and a set of 3 machines
that work together to parse/index).
Any help would be muchly appreciated as otherwise I am doomed. Thanks,
ahead of time.
-Matt Zytaruk
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers