Just a followup, i figured out the 3rd exception below ( Exception in thread "main" java.io.IOException: No input directories specified in: NutchConf..) so no worries there. but the others are still issues.

Matt Zytaruk wrote:

I've been having a lot of trouble lately with the newest nutch src. Both my crawls and parses are failing (for our fetches we crawl and parse at the same time with just the default nutch config, just to get the outlinks and update the crawldb, but then later on, after the fetch we do another parse with custom parse filters). Here are the exceptions below.

This exception happens sometimes when crawling (on the linkdb part of the crawl):

Exception in thread "main" java.io.IOException: Not a file: /user/nutch/segments/20060107130328/parse_data/part-00000/data
       at org.apache.nutch.ipc.Client.call(Client.java:294)
       at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
       at $Proxy1.submitJob(Unknown Source)
       at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
       at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
       at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:131)

We also got this for awhile (seems like the mapred/system dir is never being created for some reason): java.io.IOException: Cannot open filename /nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml
      at org.apache.nutch.ipc.Client.call(Client.java:294)
      at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
      at $Proxy1.open(Unknown Source)
at org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.openInfo(NDFSClient.java:256) at org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.<init>(NDFSClient.java:242)
      at org.apache.nutch.ndfs.NDFSClient.open(NDFSClient.java:79)
at org.apache.nutch.fs.NDFSFileSystem.openRaw(NDFSFileSystem.java:66) at org.apache.nutch.fs.NFSDataInputStream$Checker.<init>(NFSDataInputStream.java:45) at org.apache.nutch.fs.NFSDataInputStream.<init>(NFSDataInputStream.java:221) at org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160) at org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:149) at org.apache.nutch.fs.NDFSFileSystem.copyToLocalFile(NDFSFileSystem.java:221) at org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:346) at org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:332) at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:232)
      at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:286)
      at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:651)

Then, on parsing, we got this, within 10 second of the parse starting:

060109 093759 task_m_ltgpnj  Error running child
060109 093759 task_m_ltgpnj java.lang.RuntimeException: java.io.EOFException 060109 093759 task_m_ltgpnj at org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:57) 060109 093759 task_m_ltgpnj at org.apache.nutch.protocol.Content.getContent(Content.java:124) 060109 093759 task_m_ltgpnj at org.apache.nutch.crawl.MD5Signature.calculate(MD5Signature.java:33) 060109 093759 task_m_ltgpnj at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:62) 060109 093759 task_m_ltgpnj at org.apache.nutch.mapred.MapRunner.run(MapRunner.java:52) 060109 093759 task_m_ltgpnj at org.apache.nutch.mapred.MapTask.run(MapTask.java:116) 060109 093759 task_m_ltgpnj at org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)
060109 093759 task_m_ltgpnj Caused by: java.io.EOFException
060109 093759 task_m_ltgpnj at java.io.DataInputStream.readFully(DataInputStream.java:268) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.UTF8.readChars(UTF8.java:212) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.UTF8.readString(UTF8.java:204) 060109 093759 task_m_ltgpnj at org.apache.nutch.protocol.ContentProperties.readFields(ContentProperties.java:169) 060109 093759 task_m_ltgpnj at org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:81) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:54)
060109 093759 task_m_ltgpnj     ... 6 more
060109 093802 task_m_txrnu3 done; removing files.
060109 093802 Server connection on port 50050 from 127.0.0.2: exiting
060109 093805 task_m_ltgpnj done; removing files.
060109 093805 Lost connection to JobTracker [crawler-d-03.internal.wavefire.ca/127.0.0.2:8050]. ex=java.lang.NullPointerException Retrying...

On a different segment we got this instead:
Exception in thread "main" java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml , /nutch-data/nutch/tmp/nutch/mapred/local/jobTracker/job_tn7u97.xml , nutch-site.xml
       at org.apache.nutch.ipc.Client.call(Client.java:294)
       at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
       at $Proxy0.submitJob(Unknown Source)
       at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
       at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
       at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:95)
       at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:113)

(I think you usually get this error when you dont put the right filenames in arguments, but that is definately not the case here)


These are all tasks on segments which worked fine before we changed src code (we had been working with the src from about the beginning of december previously). It's also not a permissions issue as it all worked fine previously. The only things that have changed are the updated code and the number of map/reduce tasks in the config (side note: what is the best number of tasks for each to use? we have a set of 2 machines that works together to crawl, and a set of 3 machines that work together to parse/index).

Any help would be muchly appreciated as otherwise I am doomed. Thanks, ahead of time.

-Matt Zytaruk







-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to