[ 
https://issues.apache.org/jira/browse/NUTCH-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2795:
--------------------------------------

    Assignee: Sebastian Nagel

> CrawlDbReader: compress CrawlDb dumps if configured
> ---------------------------------------------------
>
>                 Key: NUTCH-2795
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2795
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb
>    Affects Versions: 1.17
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Minor
>              Labels: help-wanted
>             Fix For: 1.19
>
>
> The dumps of CrawlDbReader (text, CSV, JSON) are not compressed given the 
> configured file output compression. E.g., if running
> {noformat}
> $> bin/nutch readdb \
>        -Dmapreduce.output.fileoutputformat.compress=true  \
>        
> -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
>  \
>        crawldb/ -dump crawldb.dump -format json
> {noformat}
> the output should be compressed using bzip2.
> See the Hadoop class 
> [FileOutputFormat|https://hadoop.apache.org/docs/r3.1.3/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html]
>  and the [implementation in 
> TextOutputFormat|https://github.com/apache/hadoop/blob/639acb6d8921127cde3174a302f2e3d71b44f052/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.java].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to