[
https://issues.apache.org/jira/browse/HADOOP-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15550599#comment-15550599
]
churro morales edited comment on HADOOP-13578 at 10/6/16 3:18 AM:
------------------------------------------------------------------
[~jlowe] thank you for the thorough review. The reason that the zstd cli and
hadoop can't read each other's compressed / decompressed data is that
ZStandardCodec uses the Block(Compressor|Decompressor) stream. I assumed this
library would be used to compress large amounts of data. So when you use this
stream each block gets a header and some compressed data. I believe the 8
bytes you are referring are two ints (the size of the compressed and
uncompressed block). If you remove these headers then the cli will be able to
read the zstd blocks and if you use the zstd-cli and compress a file (prepend
the header for sizes) it will work in hadoop.
The snappy compressor / decompressor works in the same way. I do not believe
you can compress in snappy format using hadoop then transfer the file locally
and call Snappy.uncompress() without removing the headers.
If we do not want this to be compressed at a block level, that is fine.
Otherwise we can just add a utility in hadoop to take care of the block headers
like they did with hadoop-snappy or some of the CLI libraries for snappy like
snzip.
As far as the decompressed bytes, I agree. I will check to see that the size
returned from the function that tells you how many bytes are necessary to
uncompress the buffer and ensure thats not larger than our buffer size. I can
also add the isError and getErrorName to the decompression library. The reason
I explicitly checked if the expected size was equal to the desired size is
because the error that zstd provided was too vague. But I'll add it in case
there are other errors.
Yes I will look at Hadoop-13684. The build of the codec was very similar to
snappy because the license was BSD so we could package it in like snappy.
I can also take care of the nits you described as well.
Are we okay with the compression being at block level? If we are then this
implementation will work just like all of the other block compression codecs
where it will add / require the header for the hadoop blocks.
Thanks again for the review.
was (Author: churromorales):
@jlowe thank you for the thorough review. The reason that the zstd cli and
hadoop can't read each other's compressed / decompressed data is that
ZStandardCodec uses the Block(Compressor|Decompressor) stream. I assumed this
library would be used to compress large amounts of data. So when you use this
stream each block gets a header and some compressed data. I believe the 8
bytes you are referring are two ints (the size of the compressed and
uncompressed block). If you remove these headers then the cli will be able to
read the zstd blocks and if you use the zstd-cli and compress a file (prepend
the header for sizes) it will work in hadoop.
The snappy compressor / decompressor works in the same way. I do not believe
you can compress in snappy format using hadoop then transfer the file locally
and call Snappy.uncompress() without removing the headers.
If we do not want this to be compressed at a block level, that is fine.
Otherwise we can just add a utility in hadoop to take care of the block headers
like they did with hadoop-snappy or some of the CLI libraries for snappy like
snzip.
As far as the decompressed bytes, I agree. I will check to see that the size
returned from the function that tells you how many bytes are necessary to
uncompress the buffer and ensure thats not larger than our buffer size. I can
also add the isError and getErrorName to the decompression library. The reason
I explicitly checked if the expected size was equal to the desired size is
because the error that zstd provided was too vague. But I'll add it in case
there are other errors.
Yes I will look at Hadoop-13684. The build of the codec was very similar to
snappy because the license was BSD so we could package it in like snappy.
I can also take care of the nits you described as well.
Are we okay with the compression being at block level? If we are then this
implementation will work just like all of the other block compression codecs
where it will add / require the header for the hadoop blocks.
Thanks again for the review.
> Add Codec for ZStandard Compression
> -----------------------------------
>
> Key: HADOOP-13578
> URL: https://issues.apache.org/jira/browse/HADOOP-13578
> Project: Hadoop Common
> Issue Type: New Feature
> Reporter: churro morales
> Assignee: churro morales
> Attachments: HADOOP-13578.patch, HADOOP-13578.v1.patch
>
>
> ZStandard: https://github.com/facebook/zstd has been used in production for 6
> months by facebook now. v1.0 was recently released. Create a codec for this
> library.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]