Hi Martin, I've removed the guys from CC who are members of the parquet dev list. I also suggest to write to the dev list only and let the others subscribe to it if they are interested or follow the discussion at https://lists.apache.org/[email protected].
Thanks a lot for this summary and all the efforts you've already spent for this. Personally, I would be happy to see the lossless encoding you've suggested in parquet-format and then the implementations as well. Related to the lossy compression I am not sure. Until now we did not do anything like that in Parquet. However, I can see possible benefits in lossy encodings but let's handle it separately. The next step would be to initiate a vote on this list. See some details at https://www.apache.org/foundation/voting.html about the procedural voting. Also, some notes here: https://community.apache.org/committers/voting.html. You may also find some examples of voting in the archives of this mailing list. Regards, Gabor On Thu, Jul 25, 2019 at 12:56 PM Radev, Martin <[email protected]> wrote: > Dear all, > > > how should be proceeded with this proposal? > > > Would somebody like to offer feedback on the new encoding, change of > specification, and patches? > > How should we start a vote? > > > I am new to this project and do not have connections in this community. > Would a senior contributor like to help me drive this? > > > Regards, > > Martin > > ________________________________ > From: Radev, Martin <[email protected]> > Sent: Tuesday, July 23, 2019 8:22:43 PM > To: [email protected] > Cc: Zoltan Ivanfi; [email protected]; [email protected]; > [email protected]; Karlstetter, Roman; Raoofy, Amir > Subject: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or > Compression algorithm to Parquet Format > > Dear Apache Parquet Devs, > > I would like to make a proposal for extending the Apache Parquet > specification by adding a better encoding for FP data which improves > compression ratio and also to raise the question of adding a lossy > compression algorithm for FP data. > > Contents: > 1. Problem: FP data compression is suboptimal in Apache Parquet > 2. Solution idea: a new encoding for FP data to improve compression > integration of zfp for lossy compression of FP > data > 3. Our motifs for making these changes to Parquet > 4. Current implementation in parquet-mr, arrow, parquet-format > 5. Benchmark - dataset, benchmark project using Avro, results > 6. Open questions > > 1. Problem > Apache Parquet already offers a variety of encodings and compression > algorithms, yet neither compresses well 32-bit or 64-bit FP data. > There are many reasons for this: > - sources of FP data such as sensors typically add noise to measurements. > Thus, the least significant mantissa bits often contain some noise. > - the available encodings in Apache Parquet specialize for string data and > integer data. The IEEE 754 representation of FP data is significantly > different. > - the available compressors in Apache Parquet exploit repetitions in the > input sequence. For floating-point data, an element in the sequence is > either > 4-bytes or 8 bytes. Also, the least significant bits of the mantissa are > often noise which makes it very unlikely to have long subsequences being > repeated. > Thus, they often cannot perform well on raw FP data. > > 2. Solution idea > I already investigated a variety of ways to compress FP data and the > report was shared with the Parquet community. > My investigation focused on lossless compression and lossy compression. > The original report can be viewed here: > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view > [ > https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p > ]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view> > > report.pdf - Google Drive< > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view> > drive.google.com > > > > > For lossless compression, it turns out that a very simple encoding, named > "byte stream splitting", can produce very good results. Combined with zstd > it outperformed all FP-specific compressors (fpc, spdp, fpzip, zfp) for the > majority of the test cases. The encoding creates a stream for each byte of > the underlying FP type (4 for float, 8 for double) and scatters each byte > of the value to the corresponding stream. The streams are concatenated and > later compressed. The new encoding does not only offer good results, but it > is also simple to implement, has very little overhead and can even improve > performance for some cases. > > For lossy compression, I compared two lossy compressors - ZFP and SZ. SZ > outperformed ZFP in compression ratio by a reasonable margin but > unfortunately the project has bugs and the API is not thread-safe. > This makes it unsuitable for Parquet at the current moment. ZFP is a more > mature project which makes it potentially a good fit for integrating into > Parquet. We can discuss lossy compression in another thread. I only wanted > to hint that we're considering this as a great alternative for some Parquet > users since the achieved compression ratio is much higher than that of > lossless compression. > > Also, please note that this work is not about improving storage efficiency > of the decimal type but only for floats and doubles. > > 3. Our motifs > The CAPS chair at the Technical University of Munich uses Apache Parquet > for storing large amount of FP sensor data. The new encoding improves > storage efficiency - both in needed capacity and time to store. > Despite our own interests, the improvement is also beneficial for other > Parquet users who store FP data. > > 4. Status of the implementation > The current status of the implementation: > - Pull request for adding the new BYTE_STREAM_SPLIT encoding to > parquet-format: https://github.com/apache/parquet-format/pull/144 > - Patch for adding BYTE_STREAM_SPLIT to parquet-mr: > https://github.com/martinradev/parquet-mr/commit/4c0e25581fa4b454535e6dbbfb3ab9932b97350c > Patch for exposing BYTE_STREAM_SPLIT in ParquetWriter: > https://github.com/martinradev/parquet-mr/commit/2ec340d5ac8e1d6e598cb83f9b17d75f11f7ff61 > I did not send a PR for these two patches since we have to vote on the > new feature first and then get the parquet-format pull request in. > - Patch for adding BYTE_STREAM_SPLIT to Apache Arrow: > https://github.com/martinradev/arrow/commit/193c8704c4aab8fdff51f410f0206fa5ed21d801 > Again, no PR since we need to vote for changing the specification first. > - I made public the simple benchmark app which I used to collect > compression numbers: > https://github.com/martinradev/parquet-mr-streamsplit-bench > > 5. Benchmark: > For more info and results please check my mini benchmark project: > https://github.com/martinradev/parquet-mr-streamsplit-bench > The short description is that there is an improvement of 11% on average > for FP32 and 6% for FP64 when gzip is used as a compression algorithm. > Note that the improvement is higher for many of the large test cases but > the average is lower due to outliers for some small test cases. > Similar results are to be expected also with other compression algorithm > in Parquet. > > 6. Open questions > - Will you be happy to add the new BYTE_STREAM_SPLIT encoding to Apache > Parquet? > - What are your thoughts on the future addition of also lossy compression > of FP32 and FP64 to Apache Parquet? > - What are the next steps? > > Regards, > Martin > >
