Re: Floating point data compression for Apache Parquet

Radev, Martin Wed, 03 Jul 2019 15:23:23 -0700

Hello Zoltan,

> Is data pre-loaded to RAM before making the measurements?
Yes, the file is read into physical memory.

For mmap-ed files, read from external storage, I would expect, but not 100% 
sure, that the IO-overhead would be big enough that all algorithms compress 
quite close at the same speed.

>In "Figure 3: Decompression speed in MB/s", is data size measured before or 
>after uncompression?

> In "Figure 4: Compression speed in MB/s", is data size measured before or 
> after compression?
For both the reported result is "size of the original file / time to compress 
or decompress".

> According to "Figure 3: Decompression speed in MB/s", decompression of 
> bs_zstd is almost twice as fast as plain zstd. Do you know what causes this 
> massive speed improvement?

I do not know all of the details. As you mentioned, the written out data is 
less, this could potentially lead to improvement in speed as less data has to 
be written out to memory during compression or read from memory during 
decompression.

Another thing to consider is that ZSTD uses different techniques to compress a 
block of data - "raw", "RLE", "Huffman coding", "Treeless coding".

I expect that "Huffman coding" is more costly than "RLE" and I also expect that 
"RLE" to be applicable for the majority of the sign bits thus leading to a 
performance win for when the transformation is applied.

I also expect that zstd has to do some form of "optimal parsing" to decide how 
to process the input in order to compress it well. This is something every 
wanna-be-good LZ-like compressor has to do ( 
https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
 , 
http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
 ). It might be so that the transformed input is somehow easy which leads to 
faster compression rates and also easier to decompress data which leads to 
faster decompression rates.

I used this as a reference: 
https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar with 
ZSTD in particular.

I also checked that the majority of the time is spent in zstd.

Example run for msg_sweep3d.dp using zstd at level 1.
- Transformation during compression: 0.086s, ZSTD compress on transformed data: 
0.08s

- regular ZSTD: 0.34s
- ZSTD decompress from compressed transformed data: 0.067s, Transformation 
during decompression: 0.021s
- regular ZSTD decompress: 0.24s

Example run for msg_sweep3d.dp using zstd at level 20.

- Transformation during compression: 0.083s, ZSTD compress on transformed data: 
14.35s

- regular ZSTD: 183s
- ZSTD decompress from compressed transformed data: 0.075s, Transformation 
during decompression: 0.022s
- regular ZSTD decompress: 0.31s
Here it's clear that the transformed input is easier to parse (compress). Maybe 
also the blocks are of type which takes less time to decompress.

> If considering using existing libraries to provide any of the compression 
> algorithms, license compatibility is also an important factor and therefore 
> would be worth mentioning in Section 5.
This is something I forgot to list. I will back to you and the other devs with 
information.

The filter I proposed for lossless compression can be integrated without any 
concerns for a license.

> Are any of the investigated strategies applicable to DECIMAL values?
The lossy compressors SZ and ZFP do not support that outside of the box. I 
could communicate with the SZ developers to come to a decision how this can be 
added to SZ. An option is to losslessly compress the pre-decimal number and 
lossyly compress the post-decimal number.

For lossless compression, we can apply a similar stream splitting technique for 
decimal types though it might be somewhat more complex and I have not really 
though about this case.

Regards,

Martin

________________________________
From: Zoltan Ivanfi <z...@cloudera.com>
Sent: Wednesday, July 3, 2019 6:07:50 PM
To: Parquet Dev; Radev, Martin
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet

Hi Martin,

Thanks for the thorough investigation, very nice report. I would have a few 
questions:

- Is data pre-loaded to RAM before making the measurements?

- In "Figure 3: Decompression speed in MB/s", is data size measured before or 
after uncompression?

- In "Figure 4: Compression speed in MB/s", is data size measured before or 
after compression?

- According to "Figure 3: Decompression speed in MB/s", decompression of 
bs_zstd is almost twice as fast as plain zstd. Do you know what causes this 
massive speed improvement? Based on the description provided in section 3.2, 
bs_zstd uses the same zstd compression with an extra step of 
splitting/combining streams. Since this is extra work, I would have expected 
bs_zstd to be slower than pure zstd, unless the compressed data becomes so much 
smaller that it radically improves data access times. However, according to 
"Figure 2: Compression ratio", bs_zstd achieves "only" 23% better compression 
than plain zstd, which can not be the reason for the 2x speed-up in itself.

- If considering using existing libraries to provide any of the compression 
algorithms, license compatibility is also an important factor and therefore 
would be worth mentioning in Section 5.

- Are any of the investigated strategies applicable to DECIMAL values? Since 
floating point values and calculations have an inherent inaccuracy, the DECIMAL 
type is much more important for storing financial data, which is one of the 
main use cases of Parquet.

Thanks,

Zoltan

On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin 
<martin.ra...@tum.de<mailto:martin.ra...@tum.de>> wrote:
Hello folks,

thank you for your input.

I am finished with my investigation regarding introducing special support for 
FP compression in Apache Parquet.

My report also includes an investigation of lossy compressors though there are 
still some things to be cleared out.

Report: https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv

Sections 3 4 5 6 are the most important to go over.

Let me know if you have any questions or concerns.

Regards,

Martin

________________________________
From: Zoltan Ivanfi <z...@cloudera.com.INVALID>
Sent: Thursday, June 13, 2019 2:16:56 PM
To: Parquet Dev
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Floating point data compression for Apache Parquet

Hi Martin,

Thanks for your interest in improving Parquet. Efficient encodings are
really important in a big data file format, so this topic is
definitely worth researching and personally I am looking forward to
your report. Whether to add any new encodings to Parquet, however, can
not be answered until we see the results of your findings.

You mention two paths. One has very small computational overhead but
does not provide significant space savings. The other provides
significant space savings but at the price of a significant
computational overhead. While purely based on these properties both of
them seem "balanced" (one is small effort, small gain; the other is
large effort, large gain) and therefore sound reasonable options, I
would argue that one should also consider development costs, code
complexity and compatibility implications when deciding about whether
a new feature is worth implementing.

Adding a new encoding or compression to Parquet complicates the
specification of the file format and requires implementing it in every
language binding of the format, which is not only a considerable
effort, but is also error-prone (see LZ4 for an example, which was
added to both the Java and the C++ implementation of Parquet, yet they
are incompatible with each other). And lack of support is not only a
minor annoyance in this case: if one is forced to use an older reader
that does not support the new encoding yet (or a language binding that
does not support it at all), the data simply can not be read.

In my opinion, no matter how low the computational overhead of a new
encoding is, if it does not provide significant gains, then the
specification clutter, implementation costs and the potential of
compatibility problems greatly outweigh its advantages. For this
reason, I would say that only encodings that provide significant gains
are worth adding. As far as I am concerned, such a new encoding would
be a welcome addition to Parquet.

Thanks,

Zoltan

On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin 
<martin.ra...@tum.de<mailto:martin.ra...@tum.de>> wrote:
>
> Dear all,
>
> thank you for your work on the Apache Parquet format.
>
> We are a group of students at the Technical University of Munich who would 
> like to extend the available compression and encoding options for 32-bit and 
> 64-bit floating point data in Apache Parquet.
> The current encodings and compression algorithms offered in Apache Parquet 
> are heavily specialized towards integer and text data.
> Thus there is an opportunity in reducing both io throughput requirements and 
> space requirements for handling floating point data by selecting a 
> specialized compression algorithm.
>
> Currently, I am doing an investigation on the available literature and 
> publicly available fp compressors. In my investigation I am writing a report 
> on my findings - the available algorithms, their strengths and weaknesses, 
> compression rates, compression speeds and decompression speeds, and licenses. 
> Once finished I will share the report with you and make a proposal which ones 
> IMO are good candidates for Apache Parquet.
>
> The goal is to add a solution for both 32-bit and 64-bit fp types. I think 
> that it would be beneficial to offer at the very least two distinct paths. 
> The first one should offer fast compression and decompression speed with some 
> but not significant saving in space. The second one should offer slower 
> compression and decompression speed but with a decent compression rate. Both 
> lossless. A lossy path will be investigated further and discussed with the 
> community.
>
> If I get an approval from you – the developers – I can continue with adding 
> support for the new encoding/compression options in the C++ implementation of 
> Apache Parquet in Apache Arrow.
>
> Please let me know what you think of this idea and whether you have any 
> concerns with the plan.
>
> Best regards,
> Martin Radev
>

Re: Floating point data compression for Apache Parquet

Reply via email to