[ 
https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Nayak updated HADOOP-13114:
---------------------------------
    Description: 
DistCp utility should have capability to store data in user specified 
compression format. This avoids one hop of compressing data after transfer. 
Backup strategies to different cluster also get benefit of saving one IO 
operation to and from HDFS, thus saving resources, time and effort.

* Create an option -compressOutput defaulting to 
{{org.apache.hadoop.io.compress.BZip2Codec}}. 
* Users will be able to change codec with {{-D 
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
* If distcp compression is enabled, suffix the filenames with default codec 
extension to indicate the file is compressed. Thus users can be aware of what 
codec was used to compress the data.

This JIRA is similar to 
[HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. 
[HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to 
compress data *during transit* which is a huge effort. This JIRA is simplified 
to enable to user to compress data when the data lands on target filesystem.

  was:
DistCp utility should have capability to store data in user specified 
compressed format. This avoids one hop of compressing data after transfer. 
Backup strategies to different cluster gets benefit saving one IO operation, 
time and effort.

* Create a option -compressOutput with defaulting to 
{{org.apache.avro.file.BZip2Codec}}. 
* Users will be able to change codec with {{-D 
mapreduce.output.fileoutputformat.compress.codec=org.apache.avro.file.SnappyCodec}}
* If distcp compression is enables, suffix the filenames with default codec 
extension to indicate the file is compressed. Thus users can be aware of what 
codec was used to compress the data.

This JIRA is similar to 
[HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. 
[HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to 
compress data *during transit* which is a huge effort. This JIRA is simplified 
to enable to user to compress data when the data lands on target filesystem.


> DistCp should have option to compress data on write
> ---------------------------------------------------
>
>                 Key: HADOOP-13114
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13114
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Suraj Nayak
>            Assignee: Suraj Nayak
>            Priority: Minor
>             Fix For: 3.0.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified 
> compression format. This avoids one hop of compressing data after transfer. 
> Backup strategies to different cluster also get benefit of saving one IO 
> operation to and from HDFS, thus saving resources, time and effort.
> * Create an option -compressOutput defaulting to 
> {{org.apache.hadoop.io.compress.BZip2Codec}}. 
> * Users will be able to change codec with {{-D 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec 
> extension to indicate the file is compressed. Thus users can be aware of what 
> codec was used to compress the data.
> This JIRA is similar to 
> [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065]. 
> [HADOOP-8065|https://issues.apache.org/jira/browse/HADOOP-8065] aims to 
> compress data *during transit* which is a huge effort. This JIRA is 
> simplified to enable to user to compress data when the data lands on target 
> filesystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to