[ https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13594411#comment-13594411 ]
Tejas Patil commented on NUTCH-1454: ------------------------------------ Few observations about this issue: 1. Nutch is getting the correct mime type for the document. While parsing the content, this error occurs. 2. Even after running tika-app in standalone manner (ie. not via nutch), I could see not even a single chm file being parsed (I tried with 10-15 different chm files of variable sizes). I had added this observation to a [relevant jira in tika|https://issues.apache.org/jira/browse/TIKA-245?focusedCommentId=13594074&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13594074] project but no reply till now. 3. People in tika community have observed that chm4j library performs better than their chm parser implementation. Anyone in dire need to crawl and parse chm documents can leverage this library. Ideally we should use this library in nutch but as there are very low % of users in need of parsing chm, we should refrain from doing it. > parsing chm failed > ------------------ > > Key: NUTCH-1454 > URL: https://issues.apache.org/jira/browse/NUTCH-1454 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.5.1 > Reporter: Sebastian Nagel > Priority: Minor > Fix For: 1.7 > > > (reported by Jan Riewe, see > http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html) > Nutch fails to parse chm files with > {quote} > ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type > application/vnd.ms-htmlhelp > {quote} > Tested with chm test files from Tika: > {code} > % bin/nutch parsechecker > file:/.../tika/trunk/tika-parsers/src/test/resources/test-documents/testChm.chm > {code} > Tika parses this document (but does not extract any content). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira