Add back parse-html
-------------------

                 Key: NUTCH-869
                 URL: https://issues.apache.org/jira/browse/NUTCH-869
             Project: Nutch
          Issue Type: Improvement
          Components: parser
    Affects Versions: 2.0, nutchbase
            Reporter: Andrzej Bialecki 
            Assignee: Andrzej Bialecki 


We need to add back parse-html. There are a few serious problems with HTML 
parsing in Tika 0.7, so it's not possible to do a quality crawl using 
parse-tika alone. The necessary improvements to Tika are on the way, so if a 
future version of Tika > 0.7 has a chance of passing our tests we can again 
remove this plugin and use parse-tika alone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to