[
https://issues.apache.org/jira/browse/NUTCH-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051379#comment-18051379
]
ASF GitHub Bot commented on NUTCH-3110:
---------------------------------------
lewismc opened a new pull request, #887:
URL: https://github.com/apache/nutch/pull/887
This PR is an attempt to address
[NUTCH-3110](https://issues.apache.org/jira/browse/NUTCH-3110) and in the
process supersede https://github.com/apache/nutch/pull/850.
Essentially it upgrades Apache Tika from the shaded artifacts to the
official Tika 3.2.3 release, addressing compatibility issues and restoring full
functionality. Some noteworthy proposals
* Both plugins (language-identifier & parse-tika) exclude slf4j-api to
prevent class loader conflicts (NUTCH-3108)
* Duplicate outlinks: Changed `HashMap` to `LinkedHashMap` in
`DOMContentUtils.java` to preserve link insertion order while deduplicating.
* UTF-16 encoding test: Fixed double BOM issue in `TestHtmlParser.java`
where Java's UTF-16 encoder was adding a second BOM.
Boilerpipe support: Restored boilerpipe content extraction using the new
`tika-handler-boilerpipe` module.
Additionally a bunch of new tests will assist in future Tika upgrades
- TestBoilerpipeExtraction - Boilerpipe integration tests
- TestLinkExtractionEdgeCases - Link extraction behavior tests
- TestEncodingDetection - Charset detection tests
- TestMetadataExtraction - HTML metadata extraction tests
- TestParserFailureHandling - Error handling/graceful degradation tests
> Upgrade to Tika 3.2.3
> ---------------------
>
> Key: NUTCH-3110
> URL: https://issues.apache.org/jira/browse/NUTCH-3110
> Project: Nutch
> Issue Type: Improvement
> Components: dependency, parse-filter, parser
> Affects Versions: 1.20
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.22
>
>
> Upgrade either to the default Tika 3.1.0 or the shaded packages 3.1.0.0
> provided by [~tallison], see discussion in [PR
> #849|https://github.com/apache/nutch/pull/849].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)