Ok, thanks for the clarification. Having a prototype implementation ready (in the form of a PR for arrow, e.g.) to support the argument in the discussion doesn't hurt, I guess.
Just for completeness, regarding the motivation for this work: as Martin wrote, he is a student at the chair for computer architecture and parallel systems (CAPS, https://www.caps.in.tum.de/) at Technical University of Munich. We (the chair) are working on a project storing large amounts of sensor data (floating point) in parquet, and we're trying to improve storage efficiency for this type of data. Having a more efficient encoding for these floats is a first step here, but we have further ideas, like lossy compression. so we definitely should start such a discussion thread. Roman Am Di., 16. Juli 2019 um 15:59 Uhr schrieb Zoltan Ivanfi <[email protected]>: > Hi Wes, > > As far as I'm concerned, the PR was not close to being merged. I > pointed out in an earlier e-mail of this thread that we need to > involve more community members for this change. I approved the PR to > indicate that it looks good to me but was not going to merge it > without consensus. > > Br, > > Zoltan > > On Tue, Jul 16, 2019 at 1:45 PM Wes McKinney <[email protected]> wrote: > > > > I think first you need to create a [DISCUSS] thread whose subject > > clearly indicates that you are proposing to modify the Parquet format. > > You should link to the PR with the changes to parquet-format. Then > > wait for feedback to collect. > > > > Frankly, I was surprised to see a PR close to being merged based > > largely on a 2-way discussion between Martin and Zoltan (unless I > > missed something) > > > > On Tue, Jul 16, 2019 at 2:11 AM Roman Karlstetter > > <[email protected]> wrote: > > > > > > Hi Wes, > > > > > > what would be the formal or informal requirements for such a vote to > pass? > > > What is needed in terms of code and specification before we can start > such > > > a vote? > > > > > > Roman > > > > > > Am Fr., 12. Juli 2019 um 17:07 Uhr schrieb Wes McKinney < > [email protected] > > > >: > > > > > > > I think we need to vote to make any changes to the Parquet format. > New > > > > features carry a heavy responsibility > > > > > > > > On Fri, Jul 12, 2019 at 10:04 AM Michael Heuer <[email protected]> > wrote: > > > > > > > > > > Hello Martin, > > > > > > > > > > I'm willing to run some tests at scale on our genomics data when a > > > > parquet-mr pull request for the Java implementation is ready. > > > > > > > > > > Cheers, > > > > > > > > > > michael > > > > > > > > > > > > > > > > On Jul 11, 2019, at 1:09 PM, Radev, Martin <[email protected]> > > > > wrote: > > > > > > > > > > > > Dear all, > > > > > > > > > > > > > > > > > > I created a Jira issue for the new feature and also made a pull > > > > request for my patch which extends the format and documentation. > > > > > > > > > > > > Jira issue: https://issues.apache.org/jira/browse/PARQUET-1622 < > > > > https://issues.apache.org/jira/browse/PARQUET-1622> > > > > > > Pull request: https://github.com/apache/parquet-format/pull/144 > < > > > > https://github.com/apache/parquet-format/pull/144> > > > > > > > > > > > > > > > > > > I also have a WIP patch for adding the "BYTE_STREAM_SPLIT" > encoding to > > > > parquet-cpp within Apache Arrow. > > > > > > > > > > > > > > > > > > How should we proceed? > > > > > > > > > > > > It would be great to get feedback from other community members. > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > Martin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > From: Radev, Martin <[email protected] <mailto: > [email protected]>> > > > > > > Sent: Tuesday, July 9, 2019 1:01:25 AM > > > > > > To: Zoltan Ivanfi > > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman > > > > > > Subject: Re: Floating point data compression for Apache Parquet > > > > > > > > > > > > Hello Zoltan, > > > > > > > > > > > > > > > > > > I can provide a C++ and Java implementation for the encoder. > > > > > > > > > > > > The encoder/decoder is very small, and naturally I have to add > tests. > > > > > > > > > > > > I expect the biggest hurdle would be setting up the environment > and > > > > reading though the developer guides. > > > > > > > > > > > > > > > > > > I will write my patches for Apache Arrow and for Apache Parquet > and > > > > send them for review. > > > > > > > > > > > > After getting them in, I can continue with the Java > implementation. > > > > > > > > > > > > Let me know if you have any concerns. > > > > > > > > > > > > > > > > > > It would be great to get an opinion from other Parquet > contributors : ) > > > > > > > > > > > > > > > > > > Thank you for the feedback! > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > Martin > > > > > > > > > > > > ________________________________ > > > > > > From: Zoltan Ivanfi <[email protected]> > > > > > > Sent: Monday, July 8, 2019 5:06:30 PM > > > > > > To: Radev, Martin > > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman > > > > > > Subject: Re: Floating point data compression for Apache Parquet > > > > > > > > > > > > Hi Martin, > > > > > > > > > > > > I agree that bs_zstd would be a good place to start. Regarding > the > > > > choice of language, Java, C++ and Python are your options. As far as > I > > > > know, the Java implementation of Parquet has more users from the > business > > > > sector, where decimal is preferred over floating point data types. > It is > > > > also much more tightly integrated with the Hadoop ecosystem (it is > even > > > > called parquet-mr, as in MapReduce), making for a steeper learning > curve. > > > > > > > > > > > > The Python and C++ language bindings have more scientific users, > so > > > > users of these may be more interested in the new encodings. Python > is a > > > > good language for rapid prototyping as well, but the Python binding > of > > > > Parquet may use the C++ library under the hood, I'm not sure (I'm > more > > > > familiar with the Java implementation). In any case, there are at > least two > > > > Python bindings: pyarrow and fastparquet. > > > > > > > > > > > > I think we can extend the format before the actual > implementations are > > > > ready, provided that the specification is clear and nobody objects to > > > > adding it to the format. For this, I would wait for the opinion of a > few > > > > more Parquet developers first, since changes to the format that are > only > > > > supported by a single committer usually have a hard time getting > into the > > > > spec. Additionally, could you please clarify which language bindings > you > > > > plan to implement yourself? This will help the developers of the > different > > > > language bindings assess how much work they will have to do to add > support. > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Zoltan > > > > > > > > > > > > > > > > > > On Fri, Jul 5, 2019 at 4:34 PM Radev, Martin < > [email protected] > > > > <mailto:[email protected]>> wrote: > > > > > > > > > > > > Hello Zoltan and Parquet devs, > > > > > > > > > > > > > > > > > > do you think it would be appropriate to start with a Parquet > prototype > > > > from my side? > > > > > > > > > > > > I suspect that integrating 'bs_zstd' would be the simplest to > > > > integrate and from the report we can see an improvement in both > ratio and > > > > speed. > > > > > > > > > > > > > > > > > > Do you think that Apache Arrow is an appropriate place to > prototype > > > > the extension of the format? > > > > > > > > > > > > Do you agree that the enum field 'Encodings' is a suitable place > to > > > > add the 'Byte stream-splitting transformation'? In that way it could > be > > > > used with any of the other supported compressors. > > > > > > > > > > > > It might be best to also add a Java implementation of the > > > > transformation. Would the project 'parquet-mr' be a good place? > > > > > > > > > > > > > > > > > > Would the workflow be such that I write my patches, we verify for > > > > correctness, get reviews, merge them AND just then we make > adjustments to > > > > the Apache Parquet spec? > > > > > > > > > > > > > > > > > > Any piece of advice is welcome! > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > Martin > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > From: Zoltan Ivanfi <[email protected]<mailto:[email protected]>> > > > > > > Sent: Friday, July 5, 2019 4:21:39 PM > > > > > > To: Radev, Martin > > > > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman > > > > > > Subject: Re: Floating point data compression for Apache Parquet > > > > > > > > > > > > > > > > > > Hi Martin, > > > > > > > > > > > > Thanks for the explanations, makes sense. Nice work! > > > > > > > > > > > > Br, > > > > > > > > > > > > Zoltan > > > > > > > > > > > > On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin < > [email protected] > > > > <mailto:[email protected]>> wrote: > > > > > > > > > > > > Hello Zoltan, > > > > > > > > > > > > > > > > > >> Is data pre-loaded to RAM before making the measurements? > > > > > > Yes, the file is read into physical memory. > > > > > > > > > > > > For mmap-ed files, read from external storage, I would expect, > but not > > > > 100% sure, that the IO-overhead would be big enough that all > algorithms > > > > compress quite close at the same speed. > > > > > > > > > > > > > > > > > >> In "Figure 3: Decompression speed in MB/s", is data size > measured > > > > before or after uncompression? > > > > > > > > > > > >> In "Figure 4: Compression speed in MB/s", is data size measured > > > > before or after compression? > > > > > > For both the reported result is "size of the original file / > time to > > > > compress or decompress". > > > > > > > > > > > >> According to "Figure 3: Decompression speed in MB/s", > decompression > > > > of bs_zstd is almost twice as fast as plain zstd. Do you know what > causes > > > > this massive speed improvement? > > > > > > > > > > > > I do not know all of the details. As you mentioned, the written > out > > > > data is less, this could potentially lead to improvement in speed as > less > > > > data has to be written out to memory during compression or read from > memory > > > > during decompression. > > > > > > > > > > > > Another thing to consider is that ZSTD uses different techniques > to > > > > compress a block of data - "raw", "RLE", "Huffman coding", "Treeless > > > > coding". > > > > > > > > > > > > I expect that "Huffman coding" is more costly than "RLE" and I > also > > > > expect that "RLE" to be applicable for the majority of the sign bits > thus > > > > leading to a performance win for when the transformation is applied. > > > > > > > > > > > > > > > > > > I also expect that zstd has to do some form of "optimal parsing" > to > > > > decide how to process the input in order to compress it well. This is > > > > something every wanna-be-good LZ-like compressor has to do ( > > > > > https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html > > > > , > > > > > http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html > > > > ). It might be so that the transformed input is somehow easy which > leads to > > > > faster compression rates and also easier to decompress data which > leads to > > > > faster decompression rates. > > > > > > cbloom rants: 10-24-11 - LZ Optimal Parse with A Star Part 1< > > > > > http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html > > > > < > > > > > http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html > > > > >> > > > > > > cbloomrants.blogspot.com <http://cbloomrants.blogspot.com/> > > > > > > First two notes that aren't about the A-Star parse : 1. All good > LZ > > > > encoders these days that aren't optimal parsers use complicated > heuri... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I used this as a reference: > > > > https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not > familiar > > > > with ZSTD in particular. > > > > > > > > > > > > > > > > > > I also checked that the majority of the time is spent in zstd. > > > > > > > > > > > > Example run for msg_sweep3d.dp using zstd at level 1. > > > > > > - Transformation during compression: 0.086s, ZSTD compress on > > > > transformed data: 0.08s > > > > > > > > > > > > - regular ZSTD: 0.34s > > > > > > - ZSTD decompress from compressed transformed data: 0.067s, > > > > Transformation during decompression: 0.021s > > > > > > - regular ZSTD decompress: 0.24s > > > > > > > > > > > > > > > > > > Example run for msg_sweep3d.dp using zstd at level 20. > > > > > > > > > > > > - Transformation during compression: 0.083s, ZSTD compress on > > > > transformed data: 14.35s > > > > > > > > > > > > - regular ZSTD: 183s > > > > > > - ZSTD decompress from compressed transformed data: 0.075s, > > > > Transformation during decompression: 0.022s > > > > > > - regular ZSTD decompress: 0.31s > > > > > > Here it's clear that the transformed input is easier to parse > > > > (compress). Maybe also the blocks are of type which takes less time > to > > > > decompress. > > > > > > > > > > > >> If considering using existing libraries to provide any of the > > > > compression algorithms, license compatibility is also an important > factor > > > > and therefore would be worth mentioning in Section 5. > > > > > > This is something I forgot to list. I will back to you and the > other > > > > devs with information. > > > > > > > > > > > > The filter I proposed for lossless compression can be integrated > > > > without any concerns for a license. > > > > > > > > > > > > > > > > > >> Are any of the investigated strategies applicable to DECIMAL > values? > > > > > > The lossy compressors SZ and ZFP do not support that outside of > the > > > > box. I could communicate with the SZ developers to come to a > decision how > > > > this can be added to SZ. An option is to losslessly compress the > > > > pre-decimal number and lossyly compress the post-decimal number. > > > > > > > > > > > > For lossless compression, we can apply a similar stream splitting > > > > technique for decimal types though it might be somewhat more complex > and I > > > > have not really though about this case. > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > Martin > > > > > > > > > > > > ________________________________ > > > > > > From: Zoltan Ivanfi <[email protected]<mailto:[email protected]>> > > > > > > Sent: Wednesday, July 3, 2019 6:07:50 PM > > > > > > To: Parquet Dev; Radev, Martin > > > > > > Cc: Raoofy, Amir; Karlstetter, Roman > > > > > > Subject: Re: Floating point data compression for Apache Parquet > > > > > > > > > > > > Hi Martin, > > > > > > > > > > > > Thanks for the thorough investigation, very nice report. I would > have > > > > a few questions: > > > > > > > > > > > > - Is data pre-loaded to RAM before making the measurements? > > > > > > > > > > > > - In "Figure 3: Decompression speed in MB/s", is data size > measured > > > > before or after uncompression? > > > > > > > > > > > > - In "Figure 4: Compression speed in MB/s", is data size measured > > > > before or after compression? > > > > > > > > > > > > - According to "Figure 3: Decompression speed in MB/s", > decompression > > > > of bs_zstd is almost twice as fast as plain zstd. Do you know what > causes > > > > this massive speed improvement? Based on the description provided in > > > > section 3.2, bs_zstd uses the same zstd compression with an extra > step of > > > > splitting/combining streams. Since this is extra work, I would have > > > > expected bs_zstd to be slower than pure zstd, unless the compressed > data > > > > becomes so much smaller that it radically improves data access times. > > > > However, according to "Figure 2: Compression ratio", bs_zstd achieves > > > > "only" 23% better compression than plain zstd, which can not be the > reason > > > > for the 2x speed-up in itself. > > > > > > > > > > > > - If considering using existing libraries to provide any of the > > > > compression algorithms, license compatibility is also an important > factor > > > > and therefore would be worth mentioning in Section 5. > > > > > > > > > > > > - Are any of the investigated strategies applicable to DECIMAL > values? > > > > Since floating point values and calculations have an inherent > inaccuracy, > > > > the DECIMAL type is much more important for storing financial data, > which > > > > is one of the main use cases of Parquet. > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Zoltan > > > > > > > > > > > > On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin < > [email protected] > > > > <mailto:[email protected]>> wrote: > > > > > > Hello folks, > > > > > > > > > > > > > > > > > > thank you for your input. > > > > > > > > > > > > > > > > > > I am finished with my investigation regarding introducing special > > > > support for FP compression in Apache Parquet. > > > > > > > > > > > > My report also includes an investigation of lossy compressors > though > > > > there are still some things to be cleared out. > > > > > > > > > > > > > > > > > > Report: > > > > https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv > > > > > > > > > > > > > > > > > > Sections 3 4 5 6 are the most important to go over. > > > > > > > > > > > > > > > > > > Let me know if you have any questions or concerns. > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > Martin > > > > > > > > > > > > ________________________________ > > > > > > From: Zoltan Ivanfi <[email protected]> > > > > > > Sent: Thursday, June 13, 2019 2:16:56 PM > > > > > > To: Parquet Dev > > > > > > Cc: Raoofy, Amir; Karlstetter, Roman > > > > > > Subject: Re: Floating point data compression for Apache Parquet > > > > > > > > > > > > Hi Martin, > > > > > > > > > > > > Thanks for your interest in improving Parquet. Efficient > encodings are > > > > > > really important in a big data file format, so this topic is > > > > > > definitely worth researching and personally I am looking forward > to > > > > > > your report. Whether to add any new encodings to Parquet, > however, can > > > > > > not be answered until we see the results of your findings. > > > > > > > > > > > > You mention two paths. One has very small computational overhead > but > > > > > > does not provide significant space savings. The other provides > > > > > > significant space savings but at the price of a significant > > > > > > computational overhead. While purely based on these properties > both of > > > > > > them seem "balanced" (one is small effort, small gain; the other > is > > > > > > large effort, large gain) and therefore sound reasonable > options, I > > > > > > would argue that one should also consider development costs, code > > > > > > complexity and compatibility implications when deciding about > whether > > > > > > a new feature is worth implementing. > > > > > > > > > > > > Adding a new encoding or compression to Parquet complicates the > > > > > > specification of the file format and requires implementing it in > every > > > > > > language binding of the format, which is not only a considerable > > > > > > effort, but is also error-prone (see LZ4 for an example, which > was > > > > > > added to both the Java and the C++ implementation of Parquet, > yet they > > > > > > are incompatible with each other). And lack of support is not > only a > > > > > > minor annoyance in this case: if one is forced to use an older > reader > > > > > > that does not support the new encoding yet (or a language > binding that > > > > > > does not support it at all), the data simply can not be read. > > > > > > > > > > > > In my opinion, no matter how low the computational overhead of a > new > > > > > > encoding is, if it does not provide significant gains, then the > > > > > > specification clutter, implementation costs and the potential of > > > > > > compatibility problems greatly outweigh its advantages. For this > > > > > > reason, I would say that only encodings that provide significant > gains > > > > > > are worth adding. As far as I am concerned, such a new encoding > would > > > > > > be a welcome addition to Parquet. > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Zoltan > > > > > > > > > > > > On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin < > [email protected] > > > > <mailto:[email protected]>> wrote: > > > > > >> > > > > > >> Dear all, > > > > > >> > > > > > >> thank you for your work on the Apache Parquet format. > > > > > >> > > > > > >> We are a group of students at the Technical University of > Munich who > > > > would like to extend the available compression and encoding options > for > > > > 32-bit and 64-bit floating point data in Apache Parquet. > > > > > >> The current encodings and compression algorithms offered in > Apache > > > > Parquet are heavily specialized towards integer and text data. > > > > > >> Thus there is an opportunity in reducing both io throughput > > > > requirements and space requirements for handling floating point data > by > > > > selecting a specialized compression algorithm. > > > > > >> > > > > > >> Currently, I am doing an investigation on the available > literature > > > > and publicly available fp compressors. In my investigation I am > writing a > > > > report on my findings - the available algorithms, their strengths and > > > > weaknesses, compression rates, compression speeds and decompression > speeds, > > > > and licenses. Once finished I will share the report with you and > make a > > > > proposal which ones IMO are good candidates for Apache Parquet. > > > > > >> > > > > > >> The goal is to add a solution for both 32-bit and 64-bit fp > types. I > > > > think that it would be beneficial to offer at the very least two > distinct > > > > paths. The first one should offer fast compression and decompression > speed > > > > with some but not significant saving in space. The second one should > offer > > > > slower compression and decompression speed but with a decent > compression > > > > rate. Both lossless. A lossy path will be investigated further and > > > > discussed with the community. > > > > > >> > > > > > >> If I get an approval from you – the developers – I can continue > with > > > > adding support for the new encoding/compression options in the C++ > > > > implementation of Apache Parquet in Apache Arrow. > > > > > >> > > > > > >> Please let me know what you think of this idea and whether you > have > > > > any concerns with the plan. > > > > > >> > > > > > >> Best regards, > > > > > >> Martin Radev > > > > > > > > > >
