[ 
https://issues.apache.org/jira/browse/HADOOP-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15269857#comment-15269857
 ] 

Ravi Prakash commented on HADOOP-8065:
--------------------------------------

Thanks for the patch [~snayakm]! Here are some of my thoughts:

# What users seem to want, is to be able to compress data *during transit*. 
{color:red}*This patch does not enable compression of data during 
transit.*{color} Distcp is simply an MR job where maps are reading from a 
"source" . If the source does not support compressing the data before putting 
it on the network, I don't see how we could achieve what these users want.
# *We are simply enabling users to avoid a post-processing step to compress the 
data they have already transferred*. This too is a noble goal if it makes the 
lives of users easier IMHO. It also reduces the amount of space needed on the 
target filesystem. We should rewrite the JIRA summary to be more explicit if 
that is the stated goal.

Reviewing the patch:
# Do you really need the changes in {{CopyMapper}}?
# Nit: {{getCompressionCodcec}} is misspelt
# Instead of {code}      e.printStackTrace();
      LOG.error("Compression class " + compressionCodecClass
          + " not found in classpath");{code} you can simply pass {{e}} as a 
second argument to the LOG.error method.
# With this patch, we'll end up creating an instance of a Codec for every file. 
Do you think we could utilize something like 
{{org.apache.hadoop.io.compress.CodecPool}}?
# Perhaps we can add an option {{-compressOutput}} which defaults to some codec?
# Although its conceivable that we may want to decompress before writing to the 
target filesystem, we can punt that to another JIRA.

Thanks for your efforts! :-)

> distcp should have an option to compress data while copying.
> ------------------------------------------------------------
>
>                 Key: HADOOP-8065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8065
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.20.2
>            Reporter: Suresh Antony
>            Assignee: Suraj Nayak
>            Priority: Minor
>              Labels: distcp
>             Fix For: 0.20.2
>
>         Attachments: HADOOP-8065-trunk_2015-11-03.patch, 
> HADOOP-8065-trunk_2015-11-04.patch, HADOOP-8065-trunk_2016-04-29-4.patch, 
> patch.distcp.2012-02-10
>
>
> We would like compress the data while transferring from our source system to 
> target system. One way to do this is to write a map/reduce job to compress 
> that after/before being transferred. This looks inefficient. 
> Since distcp already reading writing data it would be better if it can 
> accomplish while doing this. 
> Flip side of this is that distcp -update option can not check file size 
> before copying data. It can only check for the existence of file. 
> So I propose if -compress option is given then file size is not checked.
> Also when we copy file appropriate extension needs to be added to file 
> depending on compression type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to