Re: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or Compression algorithm to Parquet Format

Gabor Szadovszky Thu, 25 Jul 2019 04:30:07 -0700

Hi Martin,

I've removed the guys from CC who are members of the parquet dev list. I
also suggest to write to the dev list only and let the others subscribe to
it if they are interested or follow the discussion at
https://lists.apache.org/[email protected].


Thanks a lot for this summary and all the efforts you've already spent for
this. Personally, I would be happy to see the lossless encoding you've
suggested in parquet-format and then the implementations as well. Related
to the lossy compression I am not sure. Until now we did not do anything
like that in Parquet. However, I can see possible benefits in lossy
encodings but let's handle it separately.

The next step would be to initiate a vote on this list. See some details at
https://www.apache.org/foundation/voting.html about the procedural voting.
Also, some notes here: https://community.apache.org/committers/voting.html.
You may also find some examples of voting in the archives of this mailing
list.

Regards,
Gabor

On Thu, Jul 25, 2019 at 12:56 PM Radev, Martin <[email protected]> wrote:

> Dear all,
>
>
> how should be proceeded with this proposal?
>
>
> Would somebody like to offer feedback on the new encoding, change of
> specification, and patches?
>
> How should we start a vote?
>
>
> I am new to this project and do not have connections in this community.
> Would a senior contributor like to help me drive this?
>
>
> Regards,
>
> Martin
>
> ________________________________
> From: Radev, Martin <[email protected]>
> Sent: Tuesday, July 23, 2019 8:22:43 PM
> To: [email protected]
> Cc: Zoltan Ivanfi; [email protected]; [email protected];
> [email protected]; Karlstetter, Roman; Raoofy, Amir
> Subject: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or
> Compression algorithm to Parquet Format
>
> Dear Apache Parquet Devs,
>
> I would like to make a proposal for extending the Apache Parquet
> specification by adding a better encoding for FP data which improves
> compression ratio and also to raise the question of adding a lossy
> compression algorithm for FP data.
>
> Contents:
> 1. Problem: FP data compression is suboptimal in Apache Parquet
> 2. Solution idea: a new encoding for FP data to improve compression
>                            integration of zfp for lossy compression of FP
> data
> 3. Our motifs for making these changes to Parquet
> 4. Current implementation in parquet-mr, arrow, parquet-format
> 5. Benchmark - dataset, benchmark project using Avro, results
> 6. Open questions
>
> 1. Problem
> Apache Parquet already offers a variety of encodings and compression
> algorithms, yet neither compresses well 32-bit or 64-bit FP data.
> There are many reasons for this:
> - sources of FP data such as sensors typically add noise to measurements.
> Thus, the least significant mantissa bits often contain some noise.
> - the available encodings in Apache Parquet specialize for string data and
> integer data. The IEEE 754 representation of FP data is significantly
> different.
> - the available compressors in Apache Parquet exploit repetitions in the
> input sequence. For floating-point data, an element in the sequence is
> either
>   4-bytes or 8 bytes. Also, the least significant bits of the mantissa are
> often noise which makes it very unlikely to have long subsequences being
> repeated.
>   Thus, they often cannot perform well on raw FP data.
>
> 2. Solution idea
> I already investigated a variety of ways to compress FP data and the
> report was shared with the Parquet community.
> My investigation focused on lossless compression and lossy compression.
> The original report can be viewed here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> [
> https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p
> ]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>
>
> report.pdf - Google Drive<
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>
> drive.google.com
>
>
>
>
> For lossless compression, it turns out that a very simple encoding, named
> "byte stream splitting", can produce very good results. Combined with zstd
> it outperformed all FP-specific compressors (fpc, spdp, fpzip, zfp) for the
> majority of the test cases. The encoding creates a stream for each byte of
> the underlying FP type (4 for float, 8 for double) and scatters each byte
> of the value to the corresponding stream. The streams are concatenated and
> later compressed. The new encoding does not only offer good results, but it
> is also simple to implement, has very little overhead and can even improve
> performance for some cases.
>
> For lossy compression, I compared two lossy compressors - ZFP and SZ. SZ
> outperformed ZFP in compression ratio by a reasonable margin but
> unfortunately the project has bugs and the API is not thread-safe.
> This makes it unsuitable for Parquet at the current moment. ZFP is a more
> mature project which makes it potentially a good fit for integrating into
> Parquet. We can discuss lossy compression in another thread. I only wanted
> to hint that we're considering this as a great alternative for some Parquet
> users since the achieved compression ratio is much higher than that of
> lossless compression.
>
> Also, please note that this work is not about improving storage efficiency
> of the decimal type but only for floats and doubles.
>
> 3. Our motifs
> The CAPS chair at the Technical University of Munich uses Apache Parquet
> for storing large amount of FP sensor data. The new encoding improves
> storage efficiency - both in needed capacity and time to store.
> Despite our own interests, the improvement is also beneficial for other
> Parquet users who store FP data.
>
> 4. Status of the implementation
> The current status of the implementation:
> - Pull request for adding the new BYTE_STREAM_SPLIT encoding to
> parquet-format: https://github.com/apache/parquet-format/pull/144
> - Patch for adding BYTE_STREAM_SPLIT to parquet-mr:
> https://github.com/martinradev/parquet-mr/commit/4c0e25581fa4b454535e6dbbfb3ab9932b97350c
>   Patch for exposing BYTE_STREAM_SPLIT in ParquetWriter:
> https://github.com/martinradev/parquet-mr/commit/2ec340d5ac8e1d6e598cb83f9b17d75f11f7ff61
>   I did not send a PR for these two patches since we have to vote on the
> new feature first and then get the parquet-format pull request in.
> - Patch for adding BYTE_STREAM_SPLIT to Apache Arrow:
> https://github.com/martinradev/arrow/commit/193c8704c4aab8fdff51f410f0206fa5ed21d801
>   Again, no PR since we need to vote for changing the specification first.
> - I made public the simple benchmark app which I used to collect
> compression numbers:
> https://github.com/martinradev/parquet-mr-streamsplit-bench
>
> 5. Benchmark:
> For more info and results please check my mini benchmark project:
> https://github.com/martinradev/parquet-mr-streamsplit-bench
> The short description is that there is an improvement of 11% on average
> for FP32 and 6% for FP64 when gzip is used as a compression algorithm.
> Note that the improvement is higher for many of the large test cases but
> the average is lower due to outliers for some small test cases.
> Similar results are to be expected also with other compression algorithm
> in Parquet.
>
> 6. Open questions
> - Will you be happy to add the new BYTE_STREAM_SPLIT encoding to Apache
> Parquet?
> - What are your thoughts on the future addition of also lossy compression
> of FP32 and FP64 to Apache Parquet?
> - What are the next steps?
>
> Regards,
> Martin
>
>

Re: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or Compression algorithm to Parquet Format

Reply via email to