[ https://issues.apache.org/jira/browse/NUTCH-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939173#comment-17939173 ]
ASF GitHub Bot commented on NUTCH-3110: --------------------------------------- sebastian-nagel opened a new pull request, #850: URL: https://github.com/apache/nutch/pull/850 - Upgrade to shaded Tika packages 3.1.0.0 provided by Tim Allison. The shaded packages are required to avoid version conflicts when running in distributed mode caused by incompatible versions of the commons-io jar shipped with Hadoop and required by Tika, cf. NUTCH-2959. - Add "text/javascript" as MIME type supported by "parse-js". Note: This fixes the parse-js unit tests. Tika 3.1.0 identifies the Javascript test document as "text/javascript" instead of "application/javascript". Todo: - [ ] fix unit test o.a.n.parse.tika.TestDOMContentUtils : duplicated outlinks - [ ] fix unit test o.a.n.parse.tika.TestHtmlParser : parsing a UTF-16 encoded HTML fails partially (no title, no keywords). Note: it might be that there are two BOMs in the test document. > Upgrade to Tika 3.1.0 > --------------------- > > Key: NUTCH-3110 > URL: https://issues.apache.org/jira/browse/NUTCH-3110 > Project: Nutch > Issue Type: Improvement > Components: dependency, parse-filter, parser > Affects Versions: 1.20 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.21 > > > Upgrade either to the default Tika 3.1.0 or the shaded packages 3.1.0.0 > provided by [~tallison], see discussion in [PR > #849|https://github.com/apache/nutch/pull/849]. -- This message was sent by Atlassian Jira (v8.20.10#820010)