Sebastian Nagel created NUTCH-2795:
--------------------------------------
Summary: CrawlDbReader: compress CrawlDb dumps if configured
Key: NUTCH-2795
URL: https://issues.apache.org/jira/browse/NUTCH-2795
Project: Nutch
Issue Type: Improvement
Components: crawldb
Affects Versions: 1.17
Reporter: Sebastian Nagel
Fix For: 1.18
The dumps of CrawlDbReader (text, CSV, JSON) are not compressed given the
configured file output compression. E.g., if running
{noformat}
$> bin/nutch readdb \
-Dmapreduce.output.fileoutputformat.compress=true \
-Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec
\
crawldb/ -dump crawldb.dump -format json
{noformat}
the output should be compressed using bzip2.
See the Hadoop class
[FileOutputFormat|https://hadoop.apache.org/docs/r3.1.3/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html]
and the [implementation in
TextOutputFormat|https://github.com/apache/hadoop/blob/639acb6d8921127cde3174a302f2e3d71b44f052/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.java].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)