[jira] [Comment Edited] (HADOOP-13578) Add Codec for ZStandard Compression

churro morales (JIRA) Wed, 05 Oct 2016 20:18:47 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15550599#comment-15550599
 ]


churro morales edited comment on HADOOP-13578 at 10/6/16 3:18 AM:
------------------------------------------------------------------

[~jlowe] thank you for the thorough review.  The reason that the zstd cli and 
hadoop can't read each other's compressed / decompressed data is that 
ZStandardCodec uses the Block(Compressor|Decompressor) stream.  I assumed this 
library would be used to compress large amounts of data.  So when you use this 
stream each block gets a header and some compressed data.  I believe the 8 
bytes you are referring are two ints (the size of the compressed and 
uncompressed block).  If you remove these headers then the cli will be able to 
read the zstd blocks and if you use the zstd-cli and compress a file (prepend 
the header for sizes) it will work in hadoop. 

The snappy compressor / decompressor works in the same way.  I do not believe 
you can compress in snappy format using hadoop then transfer the file locally 
and call Snappy.uncompress() without removing the headers. 

If we do not want this to be compressed at a block level, that is fine.  
Otherwise we can just add a utility in hadoop to take care of the block headers 
like they did with hadoop-snappy or some of the CLI libraries for snappy like 
snzip.  

As far as the decompressed bytes, I agree.  I will check to see that the size 
returned from the function that tells you how many bytes are necessary to 
uncompress the buffer and ensure thats not larger than our buffer size.  I can 
also add the isError and getErrorName to the decompression library.  The reason 
I explicitly checked if the expected size was equal to the desired size is 
because the error that zstd provided was too vague.  But I'll add it in case 
there are other errors. 

Yes I will look at Hadoop-13684.  The build of the codec was very similar to 
snappy because the license was BSD so we could package it in like snappy. 

I can also take care of the nits you described as well.  

Are we okay with the compression being at block level?  If we are then this 
implementation will work just like all of the other block compression codecs 
where it will add / require the header for the hadoop blocks.

Thanks again for the review.  




was (Author: churromorales):
@jlowe thank you for the thorough review.  The reason that the zstd cli and 
hadoop can't read each other's compressed / decompressed data is that 
ZStandardCodec uses the Block(Compressor|Decompressor) stream.  I assumed this 
library would be used to compress large amounts of data.  So when you use this 
stream each block gets a header and some compressed data.  I believe the 8 
bytes you are referring are two ints (the size of the compressed and 
uncompressed block).  If you remove these headers then the cli will be able to 
read the zstd blocks and if you use the zstd-cli and compress a file (prepend 
the header for sizes) it will work in hadoop. 

The snappy compressor / decompressor works in the same way.  I do not believe 
you can compress in snappy format using hadoop then transfer the file locally 
and call Snappy.uncompress() without removing the headers. 

If we do not want this to be compressed at a block level, that is fine.  
Otherwise we can just add a utility in hadoop to take care of the block headers 
like they did with hadoop-snappy or some of the CLI libraries for snappy like 
snzip.  

As far as the decompressed bytes, I agree.  I will check to see that the size 
returned from the function that tells you how many bytes are necessary to 
uncompress the buffer and ensure thats not larger than our buffer size.  I can 
also add the isError and getErrorName to the decompression library.  The reason 
I explicitly checked if the expected size was equal to the desired size is 
because the error that zstd provided was too vague.  But I'll add it in case 
there are other errors. 

Yes I will look at Hadoop-13684.  The build of the codec was very similar to 
snappy because the license was BSD so we could package it in like snappy. 

I can also take care of the nits you described as well.  

Are we okay with the compression being at block level?  If we are then this 
implementation will work just like all of the other block compression codecs 
where it will add / require the header for the hadoop blocks.

Thanks again for the review.  



> Add Codec for ZStandard Compression
> -----------------------------------
>
>                 Key: HADOOP-13578
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13578
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: churro morales
>            Assignee: churro morales
>         Attachments: HADOOP-13578.patch, HADOOP-13578.v1.patch
>
>
> ZStandard: https://github.com/facebook/zstd has been used in production for 6 
> months by facebook now.  v1.0 was recently released.  Create a codec for this 
> library.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HADOOP-13578) Add Codec for ZStandard Compression

Reply via email to