Re: Floating point data compression for Apache Parquet

Michael Heuer Fri, 12 Jul 2019 08:04:43 -0700

Hello Martin,

I'm willing to run some tests at scale on our genomics data when a parquet-mr 
pull request for the Java implementation is ready.


Cheers,

   michael


> On Jul 11, 2019, at 1:09 PM, Radev, Martin <[email protected]> wrote:
> 
> Dear all,
> 
> 
> I created a Jira issue for the new feature and also made a pull request for 
> my patch which extends the format and documentation.
> 
> Jira issue: https://issues.apache.org/jira/browse/PARQUET-1622 
> <https://issues.apache.org/jira/browse/PARQUET-1622>
> Pull request: https://github.com/apache/parquet-format/pull/144 
> <https://github.com/apache/parquet-format/pull/144>
> 
> 
> I also have a WIP patch for adding the "BYTE_STREAM_SPLIT" encoding to 
> parquet-cpp within Apache Arrow.
> 
> 
> How should we proceed?
> 
> It would be great to get feedback from other community members.
> 
> 
> Regards,
> 
> Martin
> 
> 
> 
> 
> ________________________________
> From: Radev, Martin <[email protected] <mailto:[email protected]>>
> Sent: Tuesday, July 9, 2019 1:01:25 AM
> To: Zoltan Ivanfi
> Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Floating point data compression for Apache Parquet
> 
> Hello Zoltan,
> 
> 
> I can provide a C++ and Java implementation for the encoder.
> 
> The encoder/decoder is very small, and naturally I have to add tests.
> 
> I expect the biggest hurdle would be setting up the environment and reading 
> though the developer guides.
> 
> 
> I will write my patches for Apache Arrow and for Apache Parquet and send them 
> for review.
> 
> After getting them in, I can continue with the Java implementation.
> 
> Let me know if you have any concerns.
> 
> 
> It would be great to get an opinion from other Parquet contributors : )
> 
> 
> Thank you for the feedback!
> 
> 
> Best regards,
> 
> Martin
> 
> ________________________________
> From: Zoltan Ivanfi <[email protected]>
> Sent: Monday, July 8, 2019 5:06:30 PM
> To: Radev, Martin
> Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Floating point data compression for Apache Parquet
> 
> Hi Martin,
> 
> I agree that bs_zstd would be a good place to start. Regarding the choice of 
> language, Java, C++ and Python are your options. As far as I know, the Java 
> implementation of Parquet has more users from the business sector, where 
> decimal is preferred over floating point data types. It is also much more 
> tightly integrated with the Hadoop ecosystem (it is even called parquet-mr, 
> as in MapReduce), making for a steeper learning curve.
> 
> The Python and C++ language bindings have more scientific users, so users of 
> these may be more interested in the new encodings. Python is a good language 
> for rapid prototyping as well, but the Python binding of Parquet may use the 
> C++ library under the hood, I'm not sure (I'm more familiar with the Java 
> implementation). In any case, there are at least two Python bindings: pyarrow 
> and fastparquet.
> 
> I think we can extend the format before the actual implementations are ready, 
> provided that the specification is clear and nobody objects to adding it to 
> the format. For this, I would wait for the opinion of a few more Parquet 
> developers first, since changes to the format that are only supported by a 
> single committer usually have a hard time getting into the spec. 
> Additionally, could you please clarify which language bindings you plan to 
> implement yourself? This will help the developers of the different language 
> bindings assess how much work they will have to do to add support.
> 
> Thanks,
> 
> Zoltan
> 
> 
> On Fri, Jul 5, 2019 at 4:34 PM Radev, Martin 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> Hello Zoltan and Parquet devs,
> 
> 
> do you think it would be appropriate to start with a Parquet prototype from 
> my side?
> 
> I suspect that integrating 'bs_zstd' would be the simplest to integrate and 
> from the report we can see an improvement in both ratio and speed.
> 
> 
> Do you think that Apache Arrow is an appropriate place to prototype the 
> extension of the format?
> 
> Do you agree that the enum field 'Encodings' is a suitable place to add the 
> 'Byte stream-splitting transformation'? In that way it could be used with any 
> of the other supported compressors.
> 
> It might be best to also add a Java implementation of the transformation. 
> Would the project 'parquet-mr' be a good place?
> 
> 
> Would the workflow be such that I write my patches, we verify for 
> correctness, get reviews, merge them AND just then we make adjustments to the 
> Apache Parquet spec?
> 
> 
> Any piece of advice is welcome!
> 
> 
> Regards,
> 
> Martin
> 
> 
> ________________________________
> From: Zoltan Ivanfi <[email protected]<mailto:[email protected]>>
> Sent: Friday, July 5, 2019 4:21:39 PM
> To: Radev, Martin
> Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Floating point data compression for Apache Parquet
> 
> 
> Hi Martin,
> 
> Thanks for the explanations, makes sense. Nice work!
> 
> Br,
> 
> Zoltan
> 
> On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> Hello Zoltan,
> 
> 
>> Is data pre-loaded to RAM before making the measurements?
> Yes, the file is read into physical memory.
> 
> For mmap-ed files, read from external storage, I would expect, but not 100% 
> sure, that the IO-overhead would be big enough that all algorithms compress 
> quite close at the same speed.
> 
> 
>> In "Figure 3: Decompression speed in MB/s", is data size measured before or 
>> after uncompression?
> 
>> In "Figure 4: Compression speed in MB/s", is data size measured before or 
>> after compression?
> For both the reported result is "size of the original file / time to compress 
> or decompress".
> 
>> According to "Figure 3: Decompression speed in MB/s", decompression of 
>> bs_zstd is almost twice as fast as plain zstd. Do you know what causes this 
>> massive speed improvement?
> 
> I do not know all of the details. As you mentioned, the written out data is 
> less, this could potentially lead to improvement in speed as less data has to 
> be written out to memory during compression or read from memory during 
> decompression.
> 
> Another thing to consider is that ZSTD uses different techniques to compress 
> a block of data - "raw", "RLE", "Huffman coding", "Treeless coding".
> 
> I expect that "Huffman coding" is more costly than "RLE" and I also expect 
> that "RLE" to be applicable for the majority of the sign bits thus leading to 
> a performance win for when the transformation is applied.
> 
> 
> I also expect that zstd has to do some form of "optimal parsing" to decide 
> how to process the input in order to compress it well. This is something 
> every wanna-be-good LZ-like compressor has to do ( 
> https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
>  , 
> http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
>  ). It might be so that the transformed input is somehow easy which leads to 
> faster compression rates and also easier to decompress data which leads to 
> faster decompression rates.
> cbloom rants: 10-24-11 - LZ Optimal Parse with A Star Part 
> 1<http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
>  
> <http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html>>
> cbloomrants.blogspot.com <http://cbloomrants.blogspot.com/>
> First two notes that aren't about the A-Star parse : 1. All good LZ encoders 
> these days that aren't optimal parsers use complicated heuri...
> 
> 
> 
> 
> I used this as a reference: 
> https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar with 
> ZSTD in particular.
> 
> 
> I also checked that the majority of the time is spent in zstd.
> 
> Example run for msg_sweep3d.dp using zstd at level 1.
> - Transformation during compression: 0.086s, ZSTD compress on transformed 
> data: 0.08s
> 
> - regular ZSTD: 0.34s
> - ZSTD decompress from compressed transformed data: 0.067s, Transformation 
> during decompression: 0.021s
> - regular ZSTD decompress: 0.24s
> 
> 
> Example run for msg_sweep3d.dp using zstd at level 20.
> 
> - Transformation during compression: 0.083s, ZSTD compress on transformed 
> data: 14.35s
> 
> - regular ZSTD: 183s
> - ZSTD decompress from compressed transformed data: 0.075s, Transformation 
> during decompression: 0.022s
> - regular ZSTD decompress: 0.31s
> Here it's clear that the transformed input is easier to parse (compress). 
> Maybe also the blocks are of type which takes less time to decompress.
> 
>> If considering using existing libraries to provide any of the compression 
>> algorithms, license compatibility is also an important factor and therefore 
>> would be worth mentioning in Section 5.
> This is something I forgot to list. I will back to you and the other devs 
> with information.
> 
> The filter I proposed for lossless compression can be integrated without any 
> concerns for a license.
> 
> 
>> Are any of the investigated strategies applicable to DECIMAL values?
> The lossy compressors SZ and ZFP do not support that outside of the box. I 
> could communicate with the SZ developers to come to a decision how this can 
> be added to SZ. An option is to losslessly compress the pre-decimal number 
> and lossyly compress the post-decimal number.
> 
> For lossless compression, we can apply a similar stream splitting technique 
> for decimal types though it might be somewhat more complex and I have not 
> really though about this case.
> 
> 
> Regards,
> 
> Martin
> 
> ________________________________
> From: Zoltan Ivanfi <[email protected]<mailto:[email protected]>>
> Sent: Wednesday, July 3, 2019 6:07:50 PM
> To: Parquet Dev; Radev, Martin
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Floating point data compression for Apache Parquet
> 
> Hi Martin,
> 
> Thanks for the thorough investigation, very nice report. I would have a few 
> questions:
> 
> - Is data pre-loaded to RAM before making the measurements?
> 
> - In "Figure 3: Decompression speed in MB/s", is data size measured before or 
> after uncompression?
> 
> - In "Figure 4: Compression speed in MB/s", is data size measured before or 
> after compression?
> 
> - According to "Figure 3: Decompression speed in MB/s", decompression of 
> bs_zstd is almost twice as fast as plain zstd. Do you know what causes this 
> massive speed improvement? Based on the description provided in section 3.2, 
> bs_zstd uses the same zstd compression with an extra step of 
> splitting/combining streams. Since this is extra work, I would have expected 
> bs_zstd to be slower than pure zstd, unless the compressed data becomes so 
> much smaller that it radically improves data access times. However, according 
> to "Figure 2: Compression ratio", bs_zstd achieves "only" 23% better 
> compression than plain zstd, which can not be the reason for the 2x speed-up 
> in itself.
> 
> - If considering using existing libraries to provide any of the compression 
> algorithms, license compatibility is also an important factor and therefore 
> would be worth mentioning in Section 5.
> 
> - Are any of the investigated strategies applicable to DECIMAL values? Since 
> floating point values and calculations have an inherent inaccuracy, the 
> DECIMAL type is much more important for storing financial data, which is one 
> of the main use cases of Parquet.
> 
> Thanks,
> 
> Zoltan
> 
> On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin 
> <[email protected]<mailto:[email protected]>> wrote:
> Hello folks,
> 
> 
> thank you for your input.
> 
> 
> I am finished with my investigation regarding introducing special support for 
> FP compression in Apache Parquet.
> 
> My report also includes an investigation of lossy compressors though there 
> are still some things to be cleared out.
> 
> 
> Report: https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
> 
> 
> Sections 3 4 5 6 are the most important to go over.
> 
> 
> Let me know if you have any questions or concerns.
> 
> 
> Regards,
> 
> Martin
> 
> ________________________________
> From: Zoltan Ivanfi <[email protected]>
> Sent: Thursday, June 13, 2019 2:16:56 PM
> To: Parquet Dev
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Floating point data compression for Apache Parquet
> 
> Hi Martin,
> 
> Thanks for your interest in improving Parquet. Efficient encodings are
> really important in a big data file format, so this topic is
> definitely worth researching and personally I am looking forward to
> your report. Whether to add any new encodings to Parquet, however, can
> not be answered until we see the results of your findings.
> 
> You mention two paths. One has very small computational overhead but
> does not provide significant space savings. The other provides
> significant space savings but at the price of a significant
> computational overhead. While purely based on these properties both of
> them seem "balanced" (one is small effort, small gain; the other is
> large effort, large gain) and therefore sound reasonable options, I
> would argue that one should also consider development costs, code
> complexity and compatibility implications when deciding about whether
> a new feature is worth implementing.
> 
> Adding a new encoding or compression to Parquet complicates the
> specification of the file format and requires implementing it in every
> language binding of the format, which is not only a considerable
> effort, but is also error-prone (see LZ4 for an example, which was
> added to both the Java and the C++ implementation of Parquet, yet they
> are incompatible with each other). And lack of support is not only a
> minor annoyance in this case: if one is forced to use an older reader
> that does not support the new encoding yet (or a language binding that
> does not support it at all), the data simply can not be read.
> 
> In my opinion, no matter how low the computational overhead of a new
> encoding is, if it does not provide significant gains, then the
> specification clutter, implementation costs and the potential of
> compatibility problems greatly outweigh its advantages. For this
> reason, I would say that only encodings that provide significant gains
> are worth adding. As far as I am concerned, such a new encoding would
> be a welcome addition to Parquet.
> 
> Thanks,
> 
> Zoltan
> 
> On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin 
> <[email protected]<mailto:[email protected]>> wrote:
>> 
>> Dear all,
>> 
>> thank you for your work on the Apache Parquet format.
>> 
>> We are a group of students at the Technical University of Munich who would 
>> like to extend the available compression and encoding options for 32-bit and 
>> 64-bit floating point data in Apache Parquet.
>> The current encodings and compression algorithms offered in Apache Parquet 
>> are heavily specialized towards integer and text data.
>> Thus there is an opportunity in reducing both io throughput requirements and 
>> space requirements for handling floating point data by selecting a 
>> specialized compression algorithm.
>> 
>> Currently, I am doing an investigation on the available literature and 
>> publicly available fp compressors. In my investigation I am writing a report 
>> on my findings - the available algorithms, their strengths and weaknesses, 
>> compression rates, compression speeds and decompression speeds, and 
>> licenses. Once finished I will share the report with you and make a proposal 
>> which ones IMO are good candidates for Apache Parquet.
>> 
>> The goal is to add a solution for both 32-bit and 64-bit fp types. I think 
>> that it would be beneficial to offer at the very least two distinct paths. 
>> The first one should offer fast compression and decompression speed with 
>> some but not significant saving in space. The second one should offer slower 
>> compression and decompression speed but with a decent compression rate. Both 
>> lossless. A lossy path will be investigated further and discussed with the 
>> community.
>> 
>> If I get an approval from you – the developers – I can continue with adding 
>> support for the new encoding/compression options in the C++ implementation 
>> of Apache Parquet in Apache Arrow.
>> 
>> Please let me know what you think of this idea and whether you have any 
>> concerns with the plan.
>> 
>> Best regards,
>> Martin Radev

Re: Floating point data compression for Apache Parquet

Reply via email to