[ 
https://issues.apache.org/jira/browse/NUTCH-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451702#comment-17451702
 ] 

Hudson commented on NUTCH-2891:
-------------------------------

SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #52 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/52/])
NUTCH-2891 Upgrade to Tika 2.1.0 (snagel: 
[https://github.com/apache/nutch/commit/b0cbea575cf819526bb7eacc9c6907986243cdcb])
* (edit) 
src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java
* (edit) src/plugin/parse-tika/ivy.xml
* (edit) 
src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/TestHTMLLanguageParser.java
* (edit) src/plugin/language-identifier/plugin.xml
* (edit) src/plugin/build.xml
* (edit) 
src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestRTFParser.java
* (edit) conf/tika-config.xml.template
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
* (edit) ivy/ivy.xml
* (edit) src/plugin/parse-tika/plugin.xml
* (edit) src/plugin/language-identifier/ivy.xml
* (edit) src/java/org/apache/nutch/util/MimeUtil.java
* (edit) src/plugin/parse-tika/howto_upgrade_tika.txt
NUTCH-2891 Upgrade to Tika 2.1.0 (snagel: 
[https://github.com/apache/nutch/commit/ad61dd1c80156934e1501efd8e077628c72f2acf])
* (edit) src/plugin/build.xml
NUTCH-2891 Upgrade to Tika 2.1.0 (snagel: 
[https://github.com/apache/nutch/commit/621c8848d6a5d6d4f62067eb618b97f94e0d3db3])
* (edit) src/plugin/parse-tika/plugin.xml
* (edit) src/plugin/parse-tika/ivy.xml


> Upgrade to Tika 2.1
> -------------------
>
>                 Key: NUTCH-2891
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2891
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser, plugin
>    Affects Versions: 1.18
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.19
>
>
> There's already the second release of Tika 2 
> ([2.1.0|https://tika.apache.org/2.1.0/index.html]). Following the [2.0 
> release notes|https://archive.apache.org/dist/tika/2.0.0/CHANGES-2.0.0.txt] 
> and the [migration 
> guide|https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0]:
> * Tika 2 is more modular which should allow us to build a smaller parse-tika 
> (66 MiB in the 1.18 binary package) by dropping rarely used parsers - but 
> users should be able to include them if they build Nutch from the sources.
> * the language-identifier plugin needs to be upgraded as well (in addition to 
> Nutch core and the parse-tika plugin). This would include or overlap with 
> NUTCH-2449.
> * to avoid that the PDF parser times out we probably want to disable the OCR 
> by default, or at least, provide the configuration snippet for this purpose



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to