[
https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13594411#comment-13594411
]
Tejas Patil commented on NUTCH-1454:
Few observations about this issue:
1. Nutch is getting the correct mime type for the document. While parsing the
content, this error occurs.
2. Even after running tika-app in standalone manner (ie. not via nutch), I
could see not even a single chm file being parsed (I tried with 10-15 different
chm files of variable sizes). I had added this observation to a [relevant jira
in
tika|https://issues.apache.org/jira/browse/TIKA-245?focusedCommentId=13594074&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13594074]
project but no reply till now.
3. People in tika community have observed that chm4j library performs better
than their chm parser implementation. Anyone in dire need to crawl and parse
chm documents can leverage this library. Ideally we should use this library in
nutch but as there are very low % of users in need of parsing chm, we should
refrain from doing it.
> parsing chm failed
> --
>
> Key: NUTCH-1454
> URL: https://issues.apache.org/jira/browse/NUTCH-1454
> Project: Nutch
> Issue Type: Bug
> Components: parser
>Affects Versions: 1.5.1
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.7
>
>
> (reported by Jan Riewe, see
> http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html)
> Nutch fails to parse chm files with
> {quote}
> ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type
> application/vnd.ms-htmlhelp
> {quote}
> Tested with chm test files from Tika:
> {code}
> % bin/nutch parsechecker
> file:/.../tika/trunk/tika-parsers/src/test/resources/test-documents/testChm.chm
> {code}
> Tika parses this document (but does not extract any content).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira