[
https://issues.apache.org/jira/browse/NUTCH-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490721#comment-16490721
]
ASF GitHub Bot commented on NUTCH-2584:
---------------------------------------
sebastian-nagel opened a new pull request #336: NUTCH-2584 Upgrade parse-tika
to use Tika 1.18
URL: https://github.com/apache/nutch/pull/336
(includes patch contributed by Ralf for NUTCH-2583)
In addition to the upgrade,
- use Tika parser (instead of nekohtml) to get the DOM tree of test documents
- fix HTMLMetaProcessor to extract no-cache and base-href attributes on DOM
tree modified by Tika
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Upgrade parse-tika to use Tika 1.18
> -----------------------------------
>
> Key: NUTCH-2584
> URL: https://issues.apache.org/jira/browse/NUTCH-2584
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.14
> Reporter: Sebastian Nagel
> Priority: Minor
> Fix For: 1.15
>
>
> Tika 1.18 is released and NUTCH-2583 includes and upgrade of tika-core.
> See
> [howto_upgrade_tika|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/howto_upgrade_tika.txt].
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)