[
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766832#comment-17766832
]
Tim Allison commented on NUTCH-2937:
------------------------------------
As [~snagel] pointed out on the PR for NUTCH-2959 -- looks like we have to wait
for Hadoop 3.4.0: https://issues.apache.org/jira/browse/HADOOP-18301 :(
Unless we revert the .wrap() in Tika in, say, 2.9.1? Yuck...
> parse-tika: review dependency exclusions and avoid dependency conflicts in
> distributed mode
> -------------------------------------------------------------------------------------------
>
> Key: NUTCH-2937
> URL: https://issues.apache.org/jira/browse/NUTCH-2937
> Project: Nutch
> Issue Type: Bug
> Components: parser, plugin
> Affects Versions: 1.19
> Reporter: Sebastian Nagel
> Priority: Major
> Fix For: 1.20
>
>
> While testing NUTCH-2919 I've seen the following error caused by a
> conflicting dependency to commons-io:
> - 2.11.0 Nutch core
> - 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
> - 2.5 provided by Hadoop
> This causes errors parsing some office and other documents (but not all), for
> example:
> {noformat}
> 2022-01-15 01:36:31,365 WARN [FetcherThread]
> org.apache.nutch.parse.ParseUtil: Error parsing
> http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError:
> 'org.apache.commons.io.input.CloseShieldInputStream
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at
> java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at
> org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
> Caused by: java.lang.NoSuchMethodError:
> 'org.apache.commons.io.input.CloseShieldInputStream
> org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
> at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)