[
https://issues.apache.org/jira/browse/NUTCH-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche resolved NUTCH-869.
---------------------------------
Resolution: Fixed
Nutchbase : Committed revision 982184
1.2 : Committed revision 982185
trunk (2.0) : Committed revision 982197
> Add back parse-html
> -------------------
>
> Key: NUTCH-869
> URL: https://issues.apache.org/jira/browse/NUTCH-869
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.2, 2.0, nutchbase
> Reporter: Andrzej Bialecki
> Assignee: Julien Nioche
> Fix For: 1.2, 2.0, nutchbase
>
>
> We need to add back parse-html. There are a few serious problems with HTML
> parsing in Tika 0.7, so it's not possible to do a quality crawl using
> parse-tika alone. The necessary improvements to Tika are on the way, so if a
> future version of Tika > 0.7 has a chance of passing our tests we can again
> remove this plugin and use parse-tika alone.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.