[ https://issues.apache.org/jira/browse/NUTCH-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451702#comment-17451702 ]
Hudson commented on NUTCH-2891: ------------------------------- SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #52 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/52/]) NUTCH-2891 Upgrade to Tika 2.1.0 (snagel: [https://github.com/apache/nutch/commit/b0cbea575cf819526bb7eacc9c6907986243cdcb]) * (edit) src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java * (edit) src/plugin/parse-tika/ivy.xml * (edit) src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/TestHTMLLanguageParser.java * (edit) src/plugin/language-identifier/plugin.xml * (edit) src/plugin/build.xml * (edit) src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestRTFParser.java * (edit) conf/tika-config.xml.template * (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java * (edit) ivy/ivy.xml * (edit) src/plugin/parse-tika/plugin.xml * (edit) src/plugin/language-identifier/ivy.xml * (edit) src/java/org/apache/nutch/util/MimeUtil.java * (edit) src/plugin/parse-tika/howto_upgrade_tika.txt NUTCH-2891 Upgrade to Tika 2.1.0 (snagel: [https://github.com/apache/nutch/commit/ad61dd1c80156934e1501efd8e077628c72f2acf]) * (edit) src/plugin/build.xml NUTCH-2891 Upgrade to Tika 2.1.0 (snagel: [https://github.com/apache/nutch/commit/621c8848d6a5d6d4f62067eb618b97f94e0d3db3]) * (edit) src/plugin/parse-tika/plugin.xml * (edit) src/plugin/parse-tika/ivy.xml > Upgrade to Tika 2.1 > ------------------- > > Key: NUTCH-2891 > URL: https://issues.apache.org/jira/browse/NUTCH-2891 > Project: Nutch > Issue Type: Improvement > Components: parser, plugin > Affects Versions: 1.18 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.19 > > > There's already the second release of Tika 2 > ([2.1.0|https://tika.apache.org/2.1.0/index.html]). Following the [2.0 > release notes|https://archive.apache.org/dist/tika/2.0.0/CHANGES-2.0.0.txt] > and the [migration > guide|https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0]: > * Tika 2 is more modular which should allow us to build a smaller parse-tika > (66 MiB in the 1.18 binary package) by dropping rarely used parsers - but > users should be able to include them if they build Nutch from the sources. > * the language-identifier plugin needs to be upgraded as well (in addition to > Nutch core and the parse-tika plugin). This would include or overlap with > NUTCH-2449. > * to avoid that the PDF parser times out we probably want to disable the OCR > by default, or at least, provide the configuration snippet for this purpose -- This message was sent by Atlassian Jira (v8.20.1#820001)