[ https://issues.apache.org/jira/browse/NUTCH-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582526#comment-17582526 ]
ASF GitHub Bot commented on NUTCH-2795: --------------------------------------- sebastian-nagel merged PR #746: URL: https://github.com/apache/nutch/pull/746 > CrawlDbReader: compress CrawlDb dumps if configured > --------------------------------------------------- > > Key: NUTCH-2795 > URL: https://issues.apache.org/jira/browse/NUTCH-2795 > Project: Nutch > Issue Type: Improvement > Components: crawldb > Affects Versions: 1.17 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Minor > Labels: help-wanted > Fix For: 1.19 > > > The dumps of CrawlDbReader (text, CSV, JSON) are not compressed given the > configured file output compression. E.g., if running > {noformat} > $> bin/nutch readdb \ > -Dmapreduce.output.fileoutputformat.compress=true \ > > -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec > \ > crawldb/ -dump crawldb.dump -format json > {noformat} > the output should be compressed using bzip2. > See the Hadoop class > [FileOutputFormat|https://hadoop.apache.org/docs/r3.1.3/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html] > and the [implementation in > TextOutputFormat|https://github.com/apache/hadoop/blob/639acb6d8921127cde3174a302f2e3d71b44f052/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.java]. -- This message was sent by Atlassian Jira (v8.20.10#820010)