[jira] [Comment Edited] (HBASE-26259) Fallback support to pure Java compression

Andrew Kyle Purtell (Jira) Tue, 21 Sep 2021 19:13:06 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418374#comment-17418374
 ]


Andrew Kyle Purtell edited comment on HBASE-26259 at 9/22/21, 2:12 AM:
-----------------------------------------------------------------------

[~zhangduo] Thanks for taking a look. 

I will be uploading new PDFs soon, when the latest round of micro benchmarks 
are finished. The results are better: LZ4 even faster (50% faster than Hadoop 
native) and Snappy not so slow (20% slower than Hadoop native at worst).

Block size is the size of the data buffer compressed or uncompressed each 
round, like an hfile block. 

Sigma is the sigma parameter to the Zipfian distribution used to generate test 
data. 1.1 is basically uncompressible. From there as sigma increases the 
compressibility of the data increases. A sigma of 2 produces data that 
compresses ok (30-40%). A sigma of 5 produces data that compresses very well. 

Number of blocks is how many times in a loop the compressor is called, or how 
many blocks are written or read from a compression stream (again it’s like how 
hfile blocks would be handled)

Time is average ms per operation as measured by JMH. 

Error is the range of variance as measured by JMH. 

Difference is how much better or worse is the provided codec in the patch as 
compared to the corresponding Hadoop native codec, as a percentage. 

"Operation" for the three microbenchmarks is defined as:

First case: Get a compressor, call setInput(), call finish(), call compress(). 
This is one round of how Hadoop compression streams drive the compressor. This 
is done over buffers of 'block size' for 'number of blocks' times. Input data 
size totals 1 MB. 

Second case: Create a compression stream, write 'number of blocks' of data, 
each write of 'block size' size. Close the stream. Input data size totals 1 MB. 

Third case: Create a decompression stream. Read 'number of blocks' of data, 
each read of 'block size' size uncompressed. Close the stream. Uncompressed 
data size totals 1MB. 


was (Author: apurtell):
I will be uploading new PDFs soon, when the latest round of micro benchmarks 
are finished. The results are better: LZ4 even faster (50% faster than Hadoop 
native) and Snappy not so slow (20% slower than Hadoop native at worst).

Block size is the size of the data buffer compressed or uncompressed each 
round, like an hfile block. 

Sigma is the sigma parameter to the Zipfian distribution used to generate test 
data. 1.1 is basically uncompressible. From there as sigma increases the 
compressibility of the data increases. A sigma of 2 produces data that 
compresses ok (30-40%). A sigma of 5 produces data that compresses very well. 

Number of blocks is how many times in a loop the compressor is called, or how 
many blocks are written or read from a compression stream (again it’s like how 
hfile blocks would be handled)

Time is average ms per operation as measured by JMH. 

Error is the range of variance as measured by JMH. 

Difference is how much better or worse is the provided codec in the patch as 
compared to the corresponding Hadoop native codec, as a percentage. 

"Operation" for the three microbenchmarks is defined as:

First case: Get a compressor, call setInput(), call finish(), call compress(). 
This is one round of how Hadoop compression streams drive the compressor. This 
is done over buffers of 'block size' for 'number of blocks' times. Input data 
size totals 1 MB. 

Second case: Create a compression stream, write 'number of blocks' of data, 
each write of 'block size' size. Close the stream. Input data size totals 1 MB. 

Third case: Create a decompression stream. Read 'number of blocks' of data, 
each read of 'block size' size uncompressed. Close the stream. Uncompressed 
data size totals 1MB. 

> Fallback support to pure Java compression
> -----------------------------------------
>
>                 Key: HBASE-26259
>                 URL: https://issues.apache.org/jira/browse/HBASE-26259
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 2.5.0, 3.0.0-alpha-2
>
>         Attachments: BenchmarkCodec.java, BenchmarksMain.java, 
> RandomDistribution.java, ac_lz4_results.pdf, ac_snappy_results.pdf, 
> ac_zstd_results.pdf, lz4_lz4-java_result.pdf, xerial_snappy_results.pdf
>
>
> Airlift’s aircompressor 
> (https://search.maven.org/artifact/io.airlift/aircompressor) is an Apache 2 
> licensed library, for Java 8 and up, available in Maven central, which 
> provides pure Java implementations of gzip, lz4, lzo, snappy, and zstd and 
> Hadoop compression codecs for same, claiming “_they are typically 300% faster 
> than the JNI wrappers_.” (https://github.com/airlift/aircompressor). This 
> library is under active development and up to date releases because it is 
> used by Trino.
> Proposed changes:
> * Modify Compression.java such that compression codec implementation classes 
> can be specified by configuration. Currently they are hardcoded as strings.
> * Pull in aircompressor as a ‘compile’ time dependency so it will be bundled 
> into our build and made available on the server classpath.
> * Modify Compression.java to fall back to an aircompressor pure Java 
> implementation if schema specifies a compression algorithm, a Hadoop native 
> codec was specified as desired implementation, but the requisite native 
> support is somehow not available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HBASE-26259) Fallback support to pure Java compression

Reply via email to