Giuseppe Totaro created NUTCH-1961:
--------------------------------------

             Summary: Provide multipart compression of Common Crawl data
                 Key: NUTCH-1961
                 URL: https://issues.apache.org/jira/browse/NUTCH-1961
             Project: Nutch
          Issue Type: Wish
    Affects Versions: 1.9
            Reporter: Giuseppe Totaro
            Priority: Minor


Using {{-gzip}} option in {{CommonCrawlDataDumper}}, users are able to compress 
data and create a TAR archive (using the [Apache Commons 
Compress|http://commons.apache.org/proper/commons-compress]. 
We could provide also the opportunity to make multipart compressed archive 
using a threshold. I did some tests using a {{CountingOutputStream}} "in the 
middle" in order to count bytes written, but it requires to flush the output 
streams at each iteration.
Furthermore, _gzip_ does not support multipart compression (we can split the 
archive in multiple {{.tar.gz}} files but they have to be unzipped 
individually), whereas _zip_ does (even though this feature is not supported 
yet in Apache Commons Compress).
I would really appreciate your feedback/ideas about this.
Thanks a lot,
Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to