Add back parse-html
-------------------
Key: NUTCH-869
URL: https://issues.apache.org/jira/browse/NUTCH-869
Project: Nutch
Issue Type: Improvement
Components: parser
Affects Versions: 2.0, nutchbase
Reporter: Andrzej Bialecki
Assignee: Andrzej Bialecki
We need to add back parse-html. There are a few serious problems with HTML
parsing in Tika 0.7, so it's not possible to do a quality crawl using
parse-tika alone. The necessary improvements to Tika are on the way, so if a
future version of Tika > 0.7 has a chance of passing our tests we can again
remove this plugin and use parse-tika alone.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.