[ 
https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13594411#comment-13594411
 ] 

Tejas Patil commented on NUTCH-1454:
------------------------------------

Few observations about this issue:
1. Nutch is getting the correct mime type for the document. While parsing the 
content, this error occurs. 
2. Even after running tika-app in standalone manner (ie. not via nutch), I 
could see not even a single chm file being parsed (I tried with 10-15 different 
chm files of variable sizes). I had added this observation to a [relevant jira 
in 
tika|https://issues.apache.org/jira/browse/TIKA-245?focusedCommentId=13594074&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13594074]
 project but no reply till now. 
3. People in tika community have observed that chm4j library performs better 
than their chm parser implementation. Anyone in dire need to crawl and parse 
chm documents can leverage this library. Ideally we should use this library in 
nutch but as there are very low % of users in need of parsing chm, we should 
refrain from doing it.
                
> parsing chm failed
> ------------------
>
>                 Key: NUTCH-1454
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1454
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5.1
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.7
>
>
> (reported by Jan Riewe, see 
> http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html)
> Nutch fails to parse chm files with
> {quote}
>  ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type 
> application/vnd.ms-htmlhelp
> {quote}
> Tested with chm test files from Tika:
> {code}
>  % bin/nutch parsechecker 
> file:/.../tika/trunk/tika-parsers/src/test/resources/test-documents/testChm.chm
> {code}
> Tika parses this document (but does not extract any content).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to