[
https://issues.apache.org/jira/browse/NUTCH-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359127#comment-14359127
]
Giuseppe Totaro commented on NUTCH-1963:
----------------------------------------
Thanks a lot [~lewismc]. We can solve this problem using
{{setLongFileMode(TarArchiveOutputStream.LONGFILE_GNU)}} for
{{TarArchiveOutputStream}} ([Apache Commons
Compress|http://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/archivers/tar/TarArchiveOutputStream.html]).
I will update the patch soon in
[https://issues.apache.org/jira/browse/NUTCH-1959|NUTCH-1959].
Thank you,
Giuseppe
> CommonsCrawlDataDumper is too long ( > 100 bytes) when -gzip option invoked
> ---------------------------------------------------------------------------
>
> Key: NUTCH-1963
> URL: https://issues.apache.org/jira/browse/NUTCH-1963
> Project: Nutch
> Issue Type: Bug
> Components: commoncrawl
> Affects Versions: 1.10
> Reporter: Lewis John McGibbney
> Fix For: 1.10
>
>
> When invoking the commoncrawldump tool with the *-gzip* option and *-mimtype
> application/pdf* I get the following stack trace which results in a failure
> of the task
> {code}
> java.lang.RuntimeException: file name
> 'Socio-Economic%20Impact%20of%20Ebola%20on%20Households%20in%20Liberia%20Nov%2019%20(final,%20revised).pdf'
> is too long ( > 100 bytes)
> at
> org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674)
> at
> org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275)
> at
> org.apache.nutch.tools.CommonCrawlDataDumper.dump(CommonCrawlDataDumper.java:400)
> at
> org.apache.nutch.tools.CommonCrawlDataDumper.main(CommonCrawlDataDumper.java:236)
> {code}
> The workaround consists of not using the *-gzip* option, instead delaying
> this until a later task, however this is a workaround and not a solution.
> We need to fix this in order for the tool to work as designed and required.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)