This is an automated email from the ASF dual-hosted git repository.
snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.
from ff800c5 Merge pull request #705 from sebastian-nagel/NUTCH-2867
new b0cbea5 NUTCH-2891 Upgrade to Tika 2.1.0 - upgrade Nutch core and the
plugins parse-tika and language-identifier - parse-tika uses on
"tika-parsers-standard-package" (no extended and scientific parsers) -
disable Tesseract OCR in tika-config.xml
new ad61dd1 NUTCH-2891 Upgrade to Tika 2.1.0 - re-enable
language-identifier test
new 621c884 NUTCH-2891 Upgrade to Tika 2.1.0 - remove commons-codec and
commons-compress from exclusions to enable parsing of
application/x-7z-compressed files
new 671f904 Merge pull request #700 from
sebastian-nagel/NUTCH-2891-tika-2.1
The 3246 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
conf/tika-config.xml.template | 16 ++-
ivy/ivy.xml | 6 +-
src/java/org/apache/nutch/util/MimeUtil.java | 3 +-
src/plugin/language-identifier/ivy.xml | 9 +-
src/plugin/language-identifier/plugin.xml | 11 ++
.../nutch/analysis/lang/HTMLLanguageParser.java | 54 +++++----
.../analysis/lang/TestHTMLLanguageParser.java | 77 ++++++-------
src/plugin/parse-tika/howto_upgrade_tika.txt | 6 +-
src/plugin/parse-tika/ivy.xml | 27 ++---
src/plugin/parse-tika/plugin.xml | 124 ++++++++-------------
.../org/apache/nutch/parse/tika/TikaParser.java | 6 +-
.../org/apache/nutch/parse/tika/TestRTFParser.java | 4 +-
12 files changed, 181 insertions(+), 162 deletions(-)