Floating point data compression for Apache Parquet

Radev, Martin Wed, 12 Jun 2019 14:10:39 -0700

Dear all,

thank you for your work on the Apache Parquet format.


We are a group of students at the Technical University of Munich who would like 
to extend the available compression and encoding options for 32-bit and 64-bit 
floating point data in Apache Parquet.
The current encodings and compression algorithms offered in Apache Parquet are 
heavily specialized towards integer and text data.
Thus there is an opportunity in reducing both io throughput requirements and 
space requirements for handling floating point data by selecting a specialized 
compression algorithm.

Currently, I am doing an investigation on the available literature and publicly 
available fp compressors. In my investigation I am writing a report on my 
findings - the available algorithms, their strengths and weaknesses, 
compression rates, compression speeds and decompression speeds, and licenses. 
Once finished I will share the report with you and make a proposal which ones 
IMO are good candidates for Apache Parquet.

The goal is to add a solution for both 32-bit and 64-bit fp types. I think that 
it would be beneficial to offer at the very least two distinct paths. The first 
one should offer fast compression and decompression speed with some but not 
significant saving in space. The second one should offer slower compression and 
decompression speed but with a decent compression rate. Both lossless. A lossy 
path will be investigated further and discussed with the community.

If I get an approval from you – the developers – I can continue with adding 
support for the new encoding/compression options in the C++ implementation of 
Apache Parquet in Apache Arrow.

Please let me know what you think of this idea and whether you have any 
concerns with the plan.

Best regards,
Martin Radev

Floating point data compression for Apache Parquet

Reply via email to