This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git.
from ff800c5 Merge pull request #705 from sebastian-nagel/NUTCH-2867 new b0cbea5 NUTCH-2891 Upgrade to Tika 2.1.0 - upgrade Nutch core and the plugins parse-tika and language-identifier - parse-tika uses on "tika-parsers-standard-package" (no extended and scientific parsers) - disable Tesseract OCR in tika-config.xml new ad61dd1 NUTCH-2891 Upgrade to Tika 2.1.0 - re-enable language-identifier test new 621c884 NUTCH-2891 Upgrade to Tika 2.1.0 - remove commons-codec and commons-compress from exclusions to enable parsing of application/x-7z-compressed files new 671f904 Merge pull request #700 from sebastian-nagel/NUTCH-2891-tika-2.1 The 3246 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: conf/tika-config.xml.template | 16 ++- ivy/ivy.xml | 6 +- src/java/org/apache/nutch/util/MimeUtil.java | 3 +- src/plugin/language-identifier/ivy.xml | 9 +- src/plugin/language-identifier/plugin.xml | 11 ++ .../nutch/analysis/lang/HTMLLanguageParser.java | 54 +++++---- .../analysis/lang/TestHTMLLanguageParser.java | 77 ++++++------- src/plugin/parse-tika/howto_upgrade_tika.txt | 6 +- src/plugin/parse-tika/ivy.xml | 27 ++--- src/plugin/parse-tika/plugin.xml | 124 ++++++++------------- .../org/apache/nutch/parse/tika/TikaParser.java | 6 +- .../org/apache/nutch/parse/tika/TestRTFParser.java | 4 +- 12 files changed, 181 insertions(+), 162 deletions(-)