[ 
https://issues.apache.org/jira/browse/NUTCH-869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894525#action_12894525
 ] 

Julien Nioche commented on NUTCH-869:
-------------------------------------

+1

> Add back parse-html
> -------------------
>
>                 Key: NUTCH-869
>                 URL: https://issues.apache.org/jira/browse/NUTCH-869
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0, nutchbase
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>
> We need to add back parse-html. There are a few serious problems with HTML 
> parsing in Tika 0.7, so it's not possible to do a quality crawl using 
> parse-tika alone. The necessary improvements to Tika are on the way, so if a 
> future version of Tika > 0.7 has a chance of passing our tests we can again 
> remove this plugin and use parse-tika alone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to