Hi Martin, Thanks for the explanations, makes sense. Nice work!
Br, Zoltan On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <[email protected]> wrote: > Hello Zoltan, > > > *> **Is data pre-loaded to RAM before making the measurements?* > Yes, the file is read into physical memory. > > For mmap-ed files, read from external storage, I would expect, but not > 100% sure, that the IO-overhead would be big enough that all algorithms > compress quite close at the same speed. > > > > *>In "Figure 3: Decompression speed in MB/s", is data size measured before > or after uncompression? * > *> In "Figure 4: Compression speed in MB/s", is data size measured before > or after compression?* > For both the reported result is "size of the original file / time to > compress or decompress". > > *> **According to "Figure 3: Decompression speed in MB/s", decompression > of bs_zstd is almost twice as fast as plain zstd. Do you know what causes > this massive speed improvement?* > > I do not know all of the details. As you mentioned, the written out data > is less, this could potentially lead to improvement in speed as less data > has to be written out to memory during compression or read from memory > during decompression. > > Another thing to consider is that ZSTD uses different techniques to > compress a block of data - "raw", "RLE", "Huffman coding", "Treeless > coding". > > I expect that "Huffman coding" is more costly than "RLE" and I also expect > that "RLE" to be applicable for the majority of the sign bits thus leading > to a performance win for when the transformation is applied. > > I also expect that zstd has to do some form of "optimal parsing" to decide > how to process the input in order to compress it well. This is something > every wanna-be-good LZ-like compressor has to do ( > https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html > , > http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html > ). > It might be so that the transformed input is somehow easy which leads to > faster compression rates and also easier to decompress data which leads to > faster decompression rates. > > I used this as a reference: > https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar > with ZSTD in particular. > > I also checked that the majority of the time is spent in zstd. > > Example run for msg_sweep3d.dp using zstd at level 1. > - Transformation during compression: 0.086s, ZSTD compress on transformed > data: 0.08s > > - regular ZSTD: 0.34s > - ZSTD decompress from compressed transformed data: 0.067s, Transformation > during decompression: 0.021s > - regular ZSTD decompress: 0.24s > > > Example run for msg_sweep3d.dp using zstd at level 20. > > - Transformation during compression: 0.083s, ZSTD compress on transformed > data: 14.35s > > - regular ZSTD: 183s > - ZSTD decompress from compressed transformed data: 0.075s, Transformation > during decompression: 0.022s > - regular ZSTD decompress: 0.31s > Here it's clear that the transformed input is easier to parse (compress). > Maybe also the blocks are of type which takes less time to decompress. > > *> **If considering using existing libraries to provide any of the > compression algorithms, license compatibility is also an important factor > and therefore would be worth mentioning in Section 5.* > This is something I forgot to list. I will back to you and the other devs > with information. > > The filter I proposed for lossless compression can be integrated without > any concerns for a license. > > > *> **Are any of the investigated strategies applicable to DECIMAL values?* > The lossy compressors SZ and ZFP do not support that outside of the box. I > could communicate with the SZ developers to come to a decision how this can > be added to SZ. An option is to losslessly compress the pre-decimal number > and lossyly compress the post-decimal number. > > For lossless compression, we can apply a similar stream splitting > technique for decimal types though it might be somewhat more complex and I > have not really though about this case. > > > Regards, > > Martin > ------------------------------ > *From:* Zoltan Ivanfi <[email protected]> > *Sent:* Wednesday, July 3, 2019 6:07:50 PM > *To:* Parquet Dev; Radev, Martin > *Cc:* Raoofy, Amir; Karlstetter, Roman > *Subject:* Re: Floating point data compression for Apache Parquet > > Hi Martin, > > Thanks for the thorough investigation, very nice report. I would have a > few questions: > > - Is data pre-loaded to RAM before making the measurements? > > - In "Figure 3: Decompression speed in MB/s", is data size measured before > or after uncompression? > > - In "Figure 4: Compression speed in MB/s", is data size measured before > or after compression? > > - According to "Figure 3: Decompression speed in MB/s", decompression of > bs_zstd is almost twice as fast as plain zstd. Do you know what causes this > massive speed improvement? Based on the description provided in section > 3.2, bs_zstd uses the same zstd compression with an extra step of > splitting/combining streams. Since this is extra work, I would have > expected bs_zstd to be slower than pure zstd, unless the compressed data > becomes so much smaller that it radically improves data access times. > However, according to "Figure 2: Compression ratio", bs_zstd achieves > "only" 23% better compression than plain zstd, which can not be the reason > for the 2x speed-up in itself. > > - If considering using existing libraries to provide any of the > compression algorithms, license compatibility is also an important factor > and therefore would be worth mentioning in Section 5. > > - Are any of the investigated strategies applicable to DECIMAL values? > Since floating point values and calculations have an inherent inaccuracy, > the DECIMAL type is much more important for storing financial data, which > is one of the main use cases of Parquet. > > Thanks, > > Zoltan > > On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <[email protected]> wrote: > >> Hello folks, >> >> >> thank you for your input. >> >> >> I am finished with my investigation regarding introducing special support >> for FP compression in Apache Parquet. >> >> My report also includes an investigation of lossy compressors though >> there are still some things to be cleared out. >> >> >> Report: >> https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv >> >> >> Sections 3 4 5 6 are the most important to go over. >> >> >> Let me know if you have any questions or concerns. >> >> >> Regards, >> >> Martin >> >> ________________________________ >> From: Zoltan Ivanfi <[email protected]> >> Sent: Thursday, June 13, 2019 2:16:56 PM >> To: Parquet Dev >> Cc: Raoofy, Amir; Karlstetter, Roman >> Subject: Re: Floating point data compression for Apache Parquet >> >> Hi Martin, >> >> Thanks for your interest in improving Parquet. Efficient encodings are >> really important in a big data file format, so this topic is >> definitely worth researching and personally I am looking forward to >> your report. Whether to add any new encodings to Parquet, however, can >> not be answered until we see the results of your findings. >> >> You mention two paths. One has very small computational overhead but >> does not provide significant space savings. The other provides >> significant space savings but at the price of a significant >> computational overhead. While purely based on these properties both of >> them seem "balanced" (one is small effort, small gain; the other is >> large effort, large gain) and therefore sound reasonable options, I >> would argue that one should also consider development costs, code >> complexity and compatibility implications when deciding about whether >> a new feature is worth implementing. >> >> Adding a new encoding or compression to Parquet complicates the >> specification of the file format and requires implementing it in every >> language binding of the format, which is not only a considerable >> effort, but is also error-prone (see LZ4 for an example, which was >> added to both the Java and the C++ implementation of Parquet, yet they >> are incompatible with each other). And lack of support is not only a >> minor annoyance in this case: if one is forced to use an older reader >> that does not support the new encoding yet (or a language binding that >> does not support it at all), the data simply can not be read. >> >> In my opinion, no matter how low the computational overhead of a new >> encoding is, if it does not provide significant gains, then the >> specification clutter, implementation costs and the potential of >> compatibility problems greatly outweigh its advantages. For this >> reason, I would say that only encodings that provide significant gains >> are worth adding. As far as I am concerned, such a new encoding would >> be a welcome addition to Parquet. >> >> Thanks, >> >> Zoltan >> >> On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <[email protected]> >> wrote: >> > >> > Dear all, >> > >> > thank you for your work on the Apache Parquet format. >> > >> > We are a group of students at the Technical University of Munich who >> would like to extend the available compression and encoding options for >> 32-bit and 64-bit floating point data in Apache Parquet. >> > The current encodings and compression algorithms offered in Apache >> Parquet are heavily specialized towards integer and text data. >> > Thus there is an opportunity in reducing both io throughput >> requirements and space requirements for handling floating point data by >> selecting a specialized compression algorithm. >> > >> > Currently, I am doing an investigation on the available literature and >> publicly available fp compressors. In my investigation I am writing a >> report on my findings - the available algorithms, their strengths and >> weaknesses, compression rates, compression speeds and decompression speeds, >> and licenses. Once finished I will share the report with you and make a >> proposal which ones IMO are good candidates for Apache Parquet. >> > >> > The goal is to add a solution for both 32-bit and 64-bit fp types. I >> think that it would be beneficial to offer at the very least two distinct >> paths. The first one should offer fast compression and decompression speed >> with some but not significant saving in space. The second one should offer >> slower compression and decompression speed but with a decent compression >> rate. Both lossless. A lossy path will be investigated further and >> discussed with the community. >> > >> > If I get an approval from you – the developers – I can continue with >> adding support for the new encoding/compression options in the C++ >> implementation of Apache Parquet in Apache Arrow. >> > >> > Please let me know what you think of this idea and whether you have any >> concerns with the plan. >> > >> > Best regards, >> > Martin Radev >> > >> >
