I've been having a lot of trouble lately with the newest nutch src. Both
my crawls and parses are failing (for our fetches we crawl and parse at
the same time with just the default nutch config, just to get the
outlinks and update the crawldb, but then later on, after the fetch we
do another parse with custom parse filters). Here are the exceptions below.
This exception happens sometimes when crawling (on the linkdb part of
the crawl):
Exception in thread "main" java.io.IOException: Not a file:
/user/nutch/segments/20060107130328/parse_data/part-00000/data
at org.apache.nutch.ipc.Client.call(Client.java:294)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
at $Proxy1.submitJob(Unknown Source)
at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:131)
We also got this for awhile (seems like the mapred/system dir is never
being created for some reason):
java.io.IOException: Cannot open filename
/nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml
at org.apache.nutch.ipc.Client.call(Client.java:294)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
at $Proxy1.open(Unknown Source)
at
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.openInfo(NDFSClient.java:256)
at
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.<init>(NDFSClient.java:242)
at org.apache.nutch.ndfs.NDFSClient.open(NDFSClient.java:79)
at
org.apache.nutch.fs.NDFSFileSystem.openRaw(NDFSFileSystem.java:66)
at
org.apache.nutch.fs.NFSDataInputStream$Checker.<init>(NFSDataInputStream.java:45)
at
org.apache.nutch.fs.NFSDataInputStream.<init>(NFSDataInputStream.java:221)
at
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160)
at
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:149)
at
org.apache.nutch.fs.NDFSFileSystem.copyToLocalFile(NDFSFileSystem.java:221)
at
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:346)
at
org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:332)
at
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:232)
at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:286)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:651)
Then, on parsing, we got this, within 10 second of the parse starting:
060109 093759 task_m_ltgpnj Error running child
060109 093759 task_m_ltgpnj java.lang.RuntimeException: java.io.EOFException
060109 093759 task_m_ltgpnj at
org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:57)
060109 093759 task_m_ltgpnj at
org.apache.nutch.protocol.Content.getContent(Content.java:124)
060109 093759 task_m_ltgpnj at
org.apache.nutch.crawl.MD5Signature.calculate(MD5Signature.java:33)
060109 093759 task_m_ltgpnj at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:62)
060109 093759 task_m_ltgpnj at
org.apache.nutch.mapred.MapRunner.run(MapRunner.java:52)
060109 093759 task_m_ltgpnj at
org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
060109 093759 task_m_ltgpnj at
org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)
060109 093759 task_m_ltgpnj Caused by: java.io.EOFException
060109 093759 task_m_ltgpnj at
java.io.DataInputStream.readFully(DataInputStream.java:268)
060109 093759 task_m_ltgpnj at
org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
060109 093759 task_m_ltgpnj at
org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
060109 093759 task_m_ltgpnj at
org.apache.nutch.io.UTF8.readChars(UTF8.java:212)
060109 093759 task_m_ltgpnj at
org.apache.nutch.io.UTF8.readString(UTF8.java:204)
060109 093759 task_m_ltgpnj at
org.apache.nutch.protocol.ContentProperties.readFields(ContentProperties.java:169)
060109 093759 task_m_ltgpnj at
org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:81)
060109 093759 task_m_ltgpnj at
org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:54)
060109 093759 task_m_ltgpnj ... 6 more
060109 093802 task_m_txrnu3 done; removing files.
060109 093802 Server connection on port 50050 from 127.0.0.2: exiting
060109 093805 task_m_ltgpnj done; removing files.
060109 093805 Lost connection to JobTracker
[crawler-d-03.internal.wavefire.ca/127.0.0.2:8050].
ex=java.lang.NullPointerException Retrying...
On a different segment we got this instead:
Exception in thread "main" java.io.IOException: No input directories
specified in: NutchConf: nutch-default.xml , mapred-default.xml ,
/nutch-data/nutch/tmp/nutch/mapred/local/jobTracker/job_tn7u97.xml ,
nutch-site.xml
at org.apache.nutch.ipc.Client.call(Client.java:294)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
at $Proxy0.submitJob(Unknown Source)
at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:95)
at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:113)
(I think you usually get this error when you dont put the right
filenames in arguments, but that is definately not the case here)
These are all tasks on segments which worked fine before we changed src
code (we had been working with the src from about the beginning of
december previously). It's also not a permissions issue as it all worked
fine previously. The only things that have changed are the updated code
and the number of map/reduce tasks in the config (side note: what is the
best number of tasks for each to use? we have a set of 2 machines that
works together to crawl, and a set of 3 machines that work together to
parse/index).
Any help would be muchly appreciated as otherwise I am doomed. Thanks,
ahead of time.
-Matt Zytaruk
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers