[
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma reopened NUTCH-1259:
----------------------------------
Assignee: Julien Nioche (was: Markus Jelsma)
Hey Julien, there's something wrong with this commit. We're seeing NPE's in the
Fetcher without stack trace now. The fetcher doesn't die but the generated seed
list is quickly terminated and few records get processed instead of millions.
It looks like it's triggered when a fetch error occurs. You can reproduce this
error by injecting a unknown host but it's likely to happen as well when socket
time outs and related errors are thrown.
{code}
fetch of http://idonotexist.openindex.io/ failed with:
java.net.UnknownHostException: idonotexist.openindex.io
fetch of http://idonotexist.openindex.io/ failed with:
java.lang.NullPointerException
fetcher caught:java.lang.NullPointerException
{code}
Can you look at it?
> Store detected content type in crawldatum metadata
> --------------------------------------------------
>
> Key: NUTCH-1259
> URL: https://issues.apache.org/jira/browse/NUTCH-1259
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Assignee: Julien Nioche
> Fix For: 1.5
>
> Attachments: NUTCH-1259-1.5-1.patch
>
>
> The MIME-type detected by Tika's Detect() API is never added to a Parse's
> ContentMetaData or ParseMetaData. Because of this bad Content-Types will end
> up in the documents.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira