[
https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13708422#comment-13708422
]
Markus Jelsma commented on NUTCH-1084:
--------------------------------------
I'm not sure on how to fix this issue in Nutch' source (if possible) and
relevant threads on the Hadoop list remain unanswered but you can work aroun
the problem by setting the job file on Hadoop's class path.
conf/hadoop-env.sh
{code}
export HADOOP_CLASSPATH=apache-nutch-1.8.job
{code}
Cheers
> ReadDB url throws exception
> ---------------------------
>
> Key: NUTCH-1084
> URL: https://issues.apache.org/jira/browse/NUTCH-1084
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.9
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to
> write the _SUCCESS file. Until now that's the solution implemented for
> similar issues. I've not been successful as to make the Hadoop readers simply
> skip the file.
> The second issue seems a bit strange and did not happen on a local check out.
> I'm not yet sure whether this is a Hadoop issue or something being corrupt in
> the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class:
> org.apache.nutch.protocol.ProtocolStatus because
> org.apache.nutch.protocol.ProtocolStatus
> at
> org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
> at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
> at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
> at
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
> at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
> at
> org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
> at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
> at
> org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira