Dear all, thank you for your work on the Apache Parquet format.
We are a group of students at the Technical University of Munich who would like to extend the available compression and encoding options for 32-bit and 64-bit floating point data in Apache Parquet. The current encodings and compression algorithms offered in Apache Parquet are heavily specialized towards integer and text data. Thus there is an opportunity in reducing both io throughput requirements and space requirements for handling floating point data by selecting a specialized compression algorithm. Currently, I am doing an investigation on the available literature and publicly available fp compressors. In my investigation I am writing a report on my findings - the available algorithms, their strengths and weaknesses, compression rates, compression speeds and decompression speeds, and licenses. Once finished I will share the report with you and make a proposal which ones IMO are good candidates for Apache Parquet. The goal is to add a solution for both 32-bit and 64-bit fp types. I think that it would be beneficial to offer at the very least two distinct paths. The first one should offer fast compression and decompression speed with some but not significant saving in space. The second one should offer slower compression and decompression speed but with a decent compression rate. Both lossless. A lossy path will be investigated further and discussed with the community. If I get an approval from you – the developers – I can continue with adding support for the new encoding/compression options in the C++ implementation of Apache Parquet in Apache Arrow. Please let me know what you think of this idea and whether you have any concerns with the plan. Best regards, Martin Radev
