[
https://issues.apache.org/jira/browse/NUTCH-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2891.
------------------------------------
Resolution: Implemented
> Upgrade to Tika 2.1
> -------------------
>
> Key: NUTCH-2891
> URL: https://issues.apache.org/jira/browse/NUTCH-2891
> Project: Nutch
> Issue Type: Improvement
> Components: parser, plugin
> Affects Versions: 1.18
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.19
>
>
> There's already the second release of Tika 2
> ([2.1.0|https://tika.apache.org/2.1.0/index.html]). Following the [2.0
> release notes|https://archive.apache.org/dist/tika/2.0.0/CHANGES-2.0.0.txt]
> and the [migration
> guide|https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0]:
> * Tika 2 is more modular which should allow us to build a smaller parse-tika
> (66 MiB in the 1.18 binary package) by dropping rarely used parsers - but
> users should be able to include them if they build Nutch from the sources.
> * the language-identifier plugin needs to be upgraded as well (in addition to
> Nutch core and the parse-tika plugin). This would include or overlap with
> NUTCH-2449.
> * to avoid that the PDF parser times out we probably want to disable the OCR
> by default, or at least, provide the configuration snippet for this purpose
--
This message was sent by Atlassian Jira
(v8.20.1#820001)