Sebastian Nagel created NUTCH-2795:
--------------------------------------

             Summary: CrawlDbReader: compress CrawlDb dumps if configured
                 Key: NUTCH-2795
                 URL: https://issues.apache.org/jira/browse/NUTCH-2795
             Project: Nutch
          Issue Type: Improvement
          Components: crawldb
    Affects Versions: 1.17
            Reporter: Sebastian Nagel
             Fix For: 1.18


The dumps of CrawlDbReader (text, CSV, JSON) are not compressed given the 
configured file output compression. E.g., if running
{noformat}
$> bin/nutch readdb \
       -Dmapreduce.output.fileoutputformat.compress=true  \
       
-Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
 \
       crawldb/ -dump crawldb.dump -format json
{noformat}
the output should be compressed using bzip2.

See the Hadoop class 
[FileOutputFormat|https://hadoop.apache.org/docs/r3.1.3/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html]
 and the [implementation in 
TextOutputFormat|https://github.com/apache/hadoop/blob/639acb6d8921127cde3174a302f2e3d71b44f052/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.java].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to