Giuseppe Totaro created NUTCH-1961:
--------------------------------------
Summary: Provide multipart compression of Common Crawl data
Key: NUTCH-1961
URL: https://issues.apache.org/jira/browse/NUTCH-1961
Project: Nutch
Issue Type: Wish
Affects Versions: 1.9
Reporter: Giuseppe Totaro
Priority: Minor
Using {{-gzip}} option in {{CommonCrawlDataDumper}}, users are able to compress
data and create a TAR archive (using the [Apache Commons
Compress|http://commons.apache.org/proper/commons-compress].
We could provide also the opportunity to make multipart compressed archive
using a threshold. I did some tests using a {{CountingOutputStream}} "in the
middle" in order to count bytes written, but it requires to flush the output
streams at each iteration.
Furthermore, _gzip_ does not support multipart compression (we can split the
archive in multiple {{.tar.gz}} files but they have to be unzipped
individually), whereas _zip_ does (even though this feature is not supported
yet in Apache Commons Compress).
I would really appreciate your feedback/ideas about this.
Thanks a lot,
Giuseppe
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)