[
https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125715#comment-13125715
]
Markus Jelsma commented on NUTCH-1084:
--------------------------------------
I've checked the write and read methods and it all sums up. It's also not
happening when running locally which makes me think it has to be with
MapWritable holding the meta data.
> ReadDB url throws exception
> ---------------------------
>
> Key: NUTCH-1084
> URL: https://issues.apache.org/jira/browse/NUTCH-1084
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.5
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to
> write the _SUCCESS file. Until now that's the solution implemented for
> similar issues. I've not been successful as to make the Hadoop readers simply
> skip the file.
> The second issue seems a bit strange and did not happen on a local check out.
> I'm not yet sure whether this is a Hadoop issue or something being corrupt in
> the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class:
> org.apache.nutch.protocol.ProtocolStatus because
> org.apache.nutch.protocol.ProtocolStatus
> at
> org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
> at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
> at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
> at
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
> at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
> at
> org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
> at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
> at
> org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira