Re: Floating point data compression for Apache Parquet

Zoltan Ivanfi Fri, 05 Jul 2019 07:22:39 -0700

Hi Martin,

Thanks for the explanations, makes sense. Nice work!


Br,

Zoltan

On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <[email protected]> wrote:

> Hello Zoltan,
>
>
> *> **Is data pre-loaded to RAM before making the measurements?*
> Yes, the file is read into physical memory.
>
> For mmap-ed files, read from external storage, I would expect, but not
> 100% sure, that the IO-overhead would be big enough that all algorithms
> compress quite close at the same speed.
>
>
>
> *>In "Figure 3: Decompression speed in MB/s", is data size measured before
> or after uncompression? *
> *> In "Figure 4: Compression speed in MB/s", is data size measured before
> or after compression?*
> For both the reported result is "size of the original file / time to
> compress or decompress".
>
> *> **According to "Figure 3: Decompression speed in MB/s", decompression
> of bs_zstd is almost twice as fast as plain zstd. Do you know what causes
> this massive speed improvement?*
>
> I do not know all of the details. As you mentioned, the written out data
> is less, this could potentially lead to improvement in speed as less data
> has to be written out to memory during compression or read from memory
> during decompression.
>
> Another thing to consider is that ZSTD uses different techniques to
> compress a block of data - "raw", "RLE", "Huffman coding", "Treeless
> coding".
>
> I expect that "Huffman coding" is more costly than "RLE" and I also expect
> that "RLE" to be applicable for the majority of the sign bits thus leading
> to a performance win for when the transformation is applied.
>
> I also expect that zstd has to do some form of "optimal parsing" to decide
> how to process the input in order to compress it well. This is something
> every wanna-be-good LZ-like compressor has to do (
> https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
>  ,
> http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
>  ).
> It might be so that the transformed input is somehow easy which leads to
> faster compression rates and also easier to decompress data which leads to
> faster decompression rates.
>
> I used this as a reference:
> https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar
> with ZSTD in particular.
>
> I also checked that the majority of the time is spent in zstd.
>
> Example run for msg_sweep3d.dp using zstd at level 1.
> - Transformation during compression: 0.086s, ZSTD compress on transformed
> data: 0.08s
>
> - regular ZSTD: 0.34s
> - ZSTD decompress from compressed transformed data: 0.067s, Transformation
> during decompression: 0.021s
> - regular ZSTD decompress: 0.24s
>
>
> Example run for msg_sweep3d.dp using zstd at level 20.
>
> - Transformation during compression: 0.083s, ZSTD compress on transformed
> data: 14.35s
>
> - regular ZSTD: 183s
> - ZSTD decompress from compressed transformed data: 0.075s, Transformation
> during decompression: 0.022s
> - regular ZSTD decompress: 0.31s
> Here it's clear that the transformed input is easier to parse (compress).
> Maybe also the blocks are of type which takes less time to decompress.
>
> *> **If considering using existing libraries to provide any of the
> compression algorithms, license compatibility is also an important factor
> and therefore would be worth mentioning in Section 5.*
> This is something I forgot to list. I will back to you and the other devs
> with information.
>
> The filter I proposed for lossless compression can be integrated without
> any concerns for a license.
>
>
> *> **Are any of the investigated strategies applicable to DECIMAL values?*
> The lossy compressors SZ and ZFP do not support that outside of the box. I
> could communicate with the SZ developers to come to a decision how this can
> be added to SZ. An option is to losslessly compress the pre-decimal number
> and lossyly compress the post-decimal number.
>
> For lossless compression, we can apply a similar stream splitting
> technique for decimal types though it might be somewhat more complex and I
> have not really though about this case.
>
>
> Regards,
>
> Martin
> ------------------------------
> *From:* Zoltan Ivanfi <[email protected]>
> *Sent:* Wednesday, July 3, 2019 6:07:50 PM
> *To:* Parquet Dev; Radev, Martin
> *Cc:* Raoofy, Amir; Karlstetter, Roman
> *Subject:* Re: Floating point data compression for Apache Parquet
>
> Hi Martin,
>
> Thanks for the thorough investigation, very nice report. I would have a
> few questions:
>
> - Is data pre-loaded to RAM before making the measurements?
>
> - In "Figure 3: Decompression speed in MB/s", is data size measured before
> or after uncompression?
>
> - In "Figure 4: Compression speed in MB/s", is data size measured before
> or after compression?
>
> - According to "Figure 3: Decompression speed in MB/s", decompression of
> bs_zstd is almost twice as fast as plain zstd. Do you know what causes this
> massive speed improvement? Based on the description provided in section
> 3.2, bs_zstd uses the same zstd compression with an extra step of
> splitting/combining streams. Since this is extra work, I would have
> expected bs_zstd to be slower than pure zstd, unless the compressed data
> becomes so much smaller that it radically improves data access times.
> However, according to "Figure 2: Compression ratio", bs_zstd achieves
> "only" 23% better compression than plain zstd, which can not be the reason
> for the 2x speed-up in itself.
>
> - If considering using existing libraries to provide any of the
> compression algorithms, license compatibility is also an important factor
> and therefore would be worth mentioning in Section 5.
>
> - Are any of the investigated strategies applicable to DECIMAL values?
> Since floating point values and calculations have an inherent inaccuracy,
> the DECIMAL type is much more important for storing financial data, which
> is one of the main use cases of Parquet.
>
> Thanks,
>
> Zoltan
>
> On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <[email protected]> wrote:
>
>> Hello folks,
>>
>>
>> thank you for your input.
>>
>>
>> I am finished with my investigation regarding introducing special support
>> for FP compression in Apache Parquet.
>>
>> My report also includes an investigation of lossy compressors though
>> there are still some things to be cleared out.
>>
>>
>> Report:
>> https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
>>
>>
>> Sections 3 4 5 6 are the most important to go over.
>>
>>
>> Let me know if you have any questions or concerns.
>>
>>
>> Regards,
>>
>> Martin
>>
>> ________________________________
>> From: Zoltan Ivanfi <[email protected]>
>> Sent: Thursday, June 13, 2019 2:16:56 PM
>> To: Parquet Dev
>> Cc: Raoofy, Amir; Karlstetter, Roman
>> Subject: Re: Floating point data compression for Apache Parquet
>>
>> Hi Martin,
>>
>> Thanks for your interest in improving Parquet. Efficient encodings are
>> really important in a big data file format, so this topic is
>> definitely worth researching and personally I am looking forward to
>> your report. Whether to add any new encodings to Parquet, however, can
>> not be answered until we see the results of your findings.
>>
>> You mention two paths. One has very small computational overhead but
>> does not provide significant space savings. The other provides
>> significant space savings but at the price of a significant
>> computational overhead. While purely based on these properties both of
>> them seem "balanced" (one is small effort, small gain; the other is
>> large effort, large gain) and therefore sound reasonable options, I
>> would argue that one should also consider development costs, code
>> complexity and compatibility implications when deciding about whether
>> a new feature is worth implementing.
>>
>> Adding a new encoding or compression to Parquet complicates the
>> specification of the file format and requires implementing it in every
>> language binding of the format, which is not only a considerable
>> effort, but is also error-prone (see LZ4 for an example, which was
>> added to both the Java and the C++ implementation of Parquet, yet they
>> are incompatible with each other). And lack of support is not only a
>> minor annoyance in this case: if one is forced to use an older reader
>> that does not support the new encoding yet (or a language binding that
>> does not support it at all), the data simply can not be read.
>>
>> In my opinion, no matter how low the computational overhead of a new
>> encoding is, if it does not provide significant gains, then the
>> specification clutter, implementation costs and the potential of
>> compatibility problems greatly outweigh its advantages. For this
>> reason, I would say that only encodings that provide significant gains
>> are worth adding. As far as I am concerned, such a new encoding would
>> be a welcome addition to Parquet.
>>
>> Thanks,
>>
>> Zoltan
>>
>> On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <[email protected]>
>> wrote:
>> >
>> > Dear all,
>> >
>> > thank you for your work on the Apache Parquet format.
>> >
>> > We are a group of students at the Technical University of Munich who
>> would like to extend the available compression and encoding options for
>> 32-bit and 64-bit floating point data in Apache Parquet.
>> > The current encodings and compression algorithms offered in Apache
>> Parquet are heavily specialized towards integer and text data.
>> > Thus there is an opportunity in reducing both io throughput
>> requirements and space requirements for handling floating point data by
>> selecting a specialized compression algorithm.
>> >
>> > Currently, I am doing an investigation on the available literature and
>> publicly available fp compressors. In my investigation I am writing a
>> report on my findings - the available algorithms, their strengths and
>> weaknesses, compression rates, compression speeds and decompression speeds,
>> and licenses. Once finished I will share the report with you and make a
>> proposal which ones IMO are good candidates for Apache Parquet.
>> >
>> > The goal is to add a solution for both 32-bit and 64-bit fp types. I
>> think that it would be beneficial to offer at the very least two distinct
>> paths. The first one should offer fast compression and decompression speed
>> with some but not significant saving in space. The second one should offer
>> slower compression and decompression speed but with a decent compression
>> rate. Both lossless. A lossy path will be investigated further and
>> discussed with the community.
>> >
>> > If I get an approval from you – the developers – I can continue with
>> adding support for the new encoding/compression options in the C++
>> implementation of Apache Parquet in Apache Arrow.
>> >
>> > Please let me know what you think of this idea and whether you have any
>> concerns with the plan.
>> >
>> > Best regards,
>> > Martin Radev
>> >
>>
>

Re: Floating point data compression for Apache Parquet

Reply via email to