Re: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or Compression algorithm to Parquet Format

Radev, Martin Thu, 25 Jul 2019 03:57:02 -0700

Dear all,


how should be proceeded with this proposal?


Would somebody like to offer feedback on the new encoding, change of 
specification, and patches?

How should we start a vote?


I am new to this project and do not have connections in this community. Would a 
senior contributor like to help me drive this?


Regards,

Martin

________________________________
From: Radev, Martin <[email protected]>
Sent: Tuesday, July 23, 2019 8:22:43 PM
To: [email protected]
Cc: Zoltan Ivanfi; [email protected]; [email protected]; 
[email protected]; Karlstetter, Roman; Raoofy, Amir
Subject: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or 
Compression algorithm to Parquet Format

Dear Apache Parquet Devs,

I would like to make a proposal for extending the Apache Parquet specification 
by adding a better encoding for FP data which improves compression ratio and 
also to raise the question of adding a lossy compression algorithm for FP data.

Contents:
1. Problem: FP data compression is suboptimal in Apache Parquet
2. Solution idea: a new encoding for FP data to improve compression
                           integration of zfp for lossy compression of FP data
3. Our motifs for making these changes to Parquet
4. Current implementation in parquet-mr, arrow, parquet-format
5. Benchmark - dataset, benchmark project using Avro, results
6. Open questions

1. Problem
Apache Parquet already offers a variety of encodings and compression 
algorithms, yet neither compresses well 32-bit or 64-bit FP data.
There are many reasons for this:
- sources of FP data such as sensors typically add noise to measurements. Thus, 
the least significant mantissa bits often contain some noise.
- the available encodings in Apache Parquet specialize for string data and 
integer data. The IEEE 754 representation of FP data is significantly different.
- the available compressors in Apache Parquet exploit repetitions in the input 
sequence. For floating-point data, an element in the sequence is either
  4-bytes or 8 bytes. Also, the least significant bits of the mantissa are 
often noise which makes it very unlikely to have long subsequences being 
repeated.
  Thus, they often cannot perform well on raw FP data.

2. Solution idea
I already investigated a variety of ways to compress FP data and the report was 
shared with the Parquet community.
My investigation focused on lossless compression and lossy compression.
The original report can be viewed here: 
https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
[https://lh3.googleusercontent.com/0Nyayc6yUgH07IH12mpd3FJ8OZ7MX282uaxarQ0ffc5sJT_-hiMR5aw60Yg=w1200-h630-p]<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>

report.pdf - Google 
Drive<https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view>
drive.google.com




For lossless compression, it turns out that a very simple encoding, named "byte 
stream splitting", can produce very good results. Combined with zstd it 
outperformed all FP-specific compressors (fpc, spdp, fpzip, zfp) for the 
majority of the test cases. The encoding creates a stream for each byte of the 
underlying FP type (4 for float, 8 for double) and scatters each byte of the 
value to the corresponding stream. The streams are concatenated and later 
compressed. The new encoding does not only offer good results, but it is also 
simple to implement, has very little overhead and can even improve performance 
for some cases.

For lossy compression, I compared two lossy compressors - ZFP and SZ. SZ 
outperformed ZFP in compression ratio by a reasonable margin but unfortunately 
the project has bugs and the API is not thread-safe.
This makes it unsuitable for Parquet at the current moment. ZFP is a more 
mature project which makes it potentially a good fit for integrating into 
Parquet. We can discuss lossy compression in another thread. I only wanted to 
hint that we're considering this as a great alternative for some Parquet users 
since the achieved compression ratio is much higher than that of lossless 
compression.

Also, please note that this work is not about improving storage efficiency of 
the decimal type but only for floats and doubles.

3. Our motifs
The CAPS chair at the Technical University of Munich uses Apache Parquet for 
storing large amount of FP sensor data. The new encoding improves storage 
efficiency - both in needed capacity and time to store.
Despite our own interests, the improvement is also beneficial for other Parquet 
users who store FP data.

4. Status of the implementation
The current status of the implementation:
- Pull request for adding the new BYTE_STREAM_SPLIT encoding to parquet-format: 
https://github.com/apache/parquet-format/pull/144
- Patch for adding BYTE_STREAM_SPLIT to parquet-mr: 
https://github.com/martinradev/parquet-mr/commit/4c0e25581fa4b454535e6dbbfb3ab9932b97350c
  Patch for exposing BYTE_STREAM_SPLIT in ParquetWriter: 
https://github.com/martinradev/parquet-mr/commit/2ec340d5ac8e1d6e598cb83f9b17d75f11f7ff61
  I did not send a PR for these two patches since we have to vote on the new 
feature first and then get the parquet-format pull request in.
- Patch for adding BYTE_STREAM_SPLIT to Apache Arrow: 
https://github.com/martinradev/arrow/commit/193c8704c4aab8fdff51f410f0206fa5ed21d801
  Again, no PR since we need to vote for changing the specification first.
- I made public the simple benchmark app which I used to collect compression 
numbers: https://github.com/martinradev/parquet-mr-streamsplit-bench

5. Benchmark:
For more info and results please check my mini benchmark project: 
https://github.com/martinradev/parquet-mr-streamsplit-bench
The short description is that there is an improvement of 11% on average for 
FP32 and 6% for FP64 when gzip is used as a compression algorithm.
Note that the improvement is higher for many of the large test cases but the 
average is lower due to outliers for some small test cases.
Similar results are to be expected also with other compression algorithm in 
Parquet.

6. Open questions
- Will you be happy to add the new BYTE_STREAM_SPLIT encoding to Apache Parquet?
- What are your thoughts on the future addition of also lossy compression of 
FP32 and FP64 to Apache Parquet?
- What are the next steps?

Regards,
Martin

Re: [DISCUSS][JAVA][C++] Add a new floating-point Encoding and/or Compression algorithm to Parquet Format

Reply via email to