Hello Martin, I'm willing to run some tests at scale on our genomics data when a parquet-mr pull request for the Java implementation is ready.
Cheers, michael > On Jul 11, 2019, at 1:09 PM, Radev, Martin <[email protected]> wrote: > > Dear all, > > > I created a Jira issue for the new feature and also made a pull request for > my patch which extends the format and documentation. > > Jira issue: https://issues.apache.org/jira/browse/PARQUET-1622 > <https://issues.apache.org/jira/browse/PARQUET-1622> > Pull request: https://github.com/apache/parquet-format/pull/144 > <https://github.com/apache/parquet-format/pull/144> > > > I also have a WIP patch for adding the "BYTE_STREAM_SPLIT" encoding to > parquet-cpp within Apache Arrow. > > > How should we proceed? > > It would be great to get feedback from other community members. > > > Regards, > > Martin > > > > > ________________________________ > From: Radev, Martin <[email protected] <mailto:[email protected]>> > Sent: Tuesday, July 9, 2019 1:01:25 AM > To: Zoltan Ivanfi > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman > Subject: Re: Floating point data compression for Apache Parquet > > Hello Zoltan, > > > I can provide a C++ and Java implementation for the encoder. > > The encoder/decoder is very small, and naturally I have to add tests. > > I expect the biggest hurdle would be setting up the environment and reading > though the developer guides. > > > I will write my patches for Apache Arrow and for Apache Parquet and send them > for review. > > After getting them in, I can continue with the Java implementation. > > Let me know if you have any concerns. > > > It would be great to get an opinion from other Parquet contributors : ) > > > Thank you for the feedback! > > > Best regards, > > Martin > > ________________________________ > From: Zoltan Ivanfi <[email protected]> > Sent: Monday, July 8, 2019 5:06:30 PM > To: Radev, Martin > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman > Subject: Re: Floating point data compression for Apache Parquet > > Hi Martin, > > I agree that bs_zstd would be a good place to start. Regarding the choice of > language, Java, C++ and Python are your options. As far as I know, the Java > implementation of Parquet has more users from the business sector, where > decimal is preferred over floating point data types. It is also much more > tightly integrated with the Hadoop ecosystem (it is even called parquet-mr, > as in MapReduce), making for a steeper learning curve. > > The Python and C++ language bindings have more scientific users, so users of > these may be more interested in the new encodings. Python is a good language > for rapid prototyping as well, but the Python binding of Parquet may use the > C++ library under the hood, I'm not sure (I'm more familiar with the Java > implementation). In any case, there are at least two Python bindings: pyarrow > and fastparquet. > > I think we can extend the format before the actual implementations are ready, > provided that the specification is clear and nobody objects to adding it to > the format. For this, I would wait for the opinion of a few more Parquet > developers first, since changes to the format that are only supported by a > single committer usually have a hard time getting into the spec. > Additionally, could you please clarify which language bindings you plan to > implement yourself? This will help the developers of the different language > bindings assess how much work they will have to do to add support. > > Thanks, > > Zoltan > > > On Fri, Jul 5, 2019 at 4:34 PM Radev, Martin > <[email protected]<mailto:[email protected]>> wrote: > > Hello Zoltan and Parquet devs, > > > do you think it would be appropriate to start with a Parquet prototype from > my side? > > I suspect that integrating 'bs_zstd' would be the simplest to integrate and > from the report we can see an improvement in both ratio and speed. > > > Do you think that Apache Arrow is an appropriate place to prototype the > extension of the format? > > Do you agree that the enum field 'Encodings' is a suitable place to add the > 'Byte stream-splitting transformation'? In that way it could be used with any > of the other supported compressors. > > It might be best to also add a Java implementation of the transformation. > Would the project 'parquet-mr' be a good place? > > > Would the workflow be such that I write my patches, we verify for > correctness, get reviews, merge them AND just then we make adjustments to the > Apache Parquet spec? > > > Any piece of advice is welcome! > > > Regards, > > Martin > > > ________________________________ > From: Zoltan Ivanfi <[email protected]<mailto:[email protected]>> > Sent: Friday, July 5, 2019 4:21:39 PM > To: Radev, Martin > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman > Subject: Re: Floating point data compression for Apache Parquet > > > Hi Martin, > > Thanks for the explanations, makes sense. Nice work! > > Br, > > Zoltan > > On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin > <[email protected]<mailto:[email protected]>> wrote: > > Hello Zoltan, > > >> Is data pre-loaded to RAM before making the measurements? > Yes, the file is read into physical memory. > > For mmap-ed files, read from external storage, I would expect, but not 100% > sure, that the IO-overhead would be big enough that all algorithms compress > quite close at the same speed. > > >> In "Figure 3: Decompression speed in MB/s", is data size measured before or >> after uncompression? > >> In "Figure 4: Compression speed in MB/s", is data size measured before or >> after compression? > For both the reported result is "size of the original file / time to compress > or decompress". > >> According to "Figure 3: Decompression speed in MB/s", decompression of >> bs_zstd is almost twice as fast as plain zstd. Do you know what causes this >> massive speed improvement? > > I do not know all of the details. As you mentioned, the written out data is > less, this could potentially lead to improvement in speed as less data has to > be written out to memory during compression or read from memory during > decompression. > > Another thing to consider is that ZSTD uses different techniques to compress > a block of data - "raw", "RLE", "Huffman coding", "Treeless coding". > > I expect that "Huffman coding" is more costly than "RLE" and I also expect > that "RLE" to be applicable for the majority of the sign bits thus leading to > a performance win for when the transformation is applied. > > > I also expect that zstd has to do some form of "optimal parsing" to decide > how to process the input in order to compress it well. This is something > every wanna-be-good LZ-like compressor has to do ( > https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html > , > http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html > ). It might be so that the transformed input is somehow easy which leads to > faster compression rates and also easier to decompress data which leads to > faster decompression rates. > cbloom rants: 10-24-11 - LZ Optimal Parse with A Star Part > 1<http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html > > <http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html>> > cbloomrants.blogspot.com <http://cbloomrants.blogspot.com/> > First two notes that aren't about the A-Star parse : 1. All good LZ encoders > these days that aren't optimal parsers use complicated heuri... > > > > > I used this as a reference: > https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar with > ZSTD in particular. > > > I also checked that the majority of the time is spent in zstd. > > Example run for msg_sweep3d.dp using zstd at level 1. > - Transformation during compression: 0.086s, ZSTD compress on transformed > data: 0.08s > > - regular ZSTD: 0.34s > - ZSTD decompress from compressed transformed data: 0.067s, Transformation > during decompression: 0.021s > - regular ZSTD decompress: 0.24s > > > Example run for msg_sweep3d.dp using zstd at level 20. > > - Transformation during compression: 0.083s, ZSTD compress on transformed > data: 14.35s > > - regular ZSTD: 183s > - ZSTD decompress from compressed transformed data: 0.075s, Transformation > during decompression: 0.022s > - regular ZSTD decompress: 0.31s > Here it's clear that the transformed input is easier to parse (compress). > Maybe also the blocks are of type which takes less time to decompress. > >> If considering using existing libraries to provide any of the compression >> algorithms, license compatibility is also an important factor and therefore >> would be worth mentioning in Section 5. > This is something I forgot to list. I will back to you and the other devs > with information. > > The filter I proposed for lossless compression can be integrated without any > concerns for a license. > > >> Are any of the investigated strategies applicable to DECIMAL values? > The lossy compressors SZ and ZFP do not support that outside of the box. I > could communicate with the SZ developers to come to a decision how this can > be added to SZ. An option is to losslessly compress the pre-decimal number > and lossyly compress the post-decimal number. > > For lossless compression, we can apply a similar stream splitting technique > for decimal types though it might be somewhat more complex and I have not > really though about this case. > > > Regards, > > Martin > > ________________________________ > From: Zoltan Ivanfi <[email protected]<mailto:[email protected]>> > Sent: Wednesday, July 3, 2019 6:07:50 PM > To: Parquet Dev; Radev, Martin > Cc: Raoofy, Amir; Karlstetter, Roman > Subject: Re: Floating point data compression for Apache Parquet > > Hi Martin, > > Thanks for the thorough investigation, very nice report. I would have a few > questions: > > - Is data pre-loaded to RAM before making the measurements? > > - In "Figure 3: Decompression speed in MB/s", is data size measured before or > after uncompression? > > - In "Figure 4: Compression speed in MB/s", is data size measured before or > after compression? > > - According to "Figure 3: Decompression speed in MB/s", decompression of > bs_zstd is almost twice as fast as plain zstd. Do you know what causes this > massive speed improvement? Based on the description provided in section 3.2, > bs_zstd uses the same zstd compression with an extra step of > splitting/combining streams. Since this is extra work, I would have expected > bs_zstd to be slower than pure zstd, unless the compressed data becomes so > much smaller that it radically improves data access times. However, according > to "Figure 2: Compression ratio", bs_zstd achieves "only" 23% better > compression than plain zstd, which can not be the reason for the 2x speed-up > in itself. > > - If considering using existing libraries to provide any of the compression > algorithms, license compatibility is also an important factor and therefore > would be worth mentioning in Section 5. > > - Are any of the investigated strategies applicable to DECIMAL values? Since > floating point values and calculations have an inherent inaccuracy, the > DECIMAL type is much more important for storing financial data, which is one > of the main use cases of Parquet. > > Thanks, > > Zoltan > > On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin > <[email protected]<mailto:[email protected]>> wrote: > Hello folks, > > > thank you for your input. > > > I am finished with my investigation regarding introducing special support for > FP compression in Apache Parquet. > > My report also includes an investigation of lossy compressors though there > are still some things to be cleared out. > > > Report: https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv > > > Sections 3 4 5 6 are the most important to go over. > > > Let me know if you have any questions or concerns. > > > Regards, > > Martin > > ________________________________ > From: Zoltan Ivanfi <[email protected]> > Sent: Thursday, June 13, 2019 2:16:56 PM > To: Parquet Dev > Cc: Raoofy, Amir; Karlstetter, Roman > Subject: Re: Floating point data compression for Apache Parquet > > Hi Martin, > > Thanks for your interest in improving Parquet. Efficient encodings are > really important in a big data file format, so this topic is > definitely worth researching and personally I am looking forward to > your report. Whether to add any new encodings to Parquet, however, can > not be answered until we see the results of your findings. > > You mention two paths. One has very small computational overhead but > does not provide significant space savings. The other provides > significant space savings but at the price of a significant > computational overhead. While purely based on these properties both of > them seem "balanced" (one is small effort, small gain; the other is > large effort, large gain) and therefore sound reasonable options, I > would argue that one should also consider development costs, code > complexity and compatibility implications when deciding about whether > a new feature is worth implementing. > > Adding a new encoding or compression to Parquet complicates the > specification of the file format and requires implementing it in every > language binding of the format, which is not only a considerable > effort, but is also error-prone (see LZ4 for an example, which was > added to both the Java and the C++ implementation of Parquet, yet they > are incompatible with each other). And lack of support is not only a > minor annoyance in this case: if one is forced to use an older reader > that does not support the new encoding yet (or a language binding that > does not support it at all), the data simply can not be read. > > In my opinion, no matter how low the computational overhead of a new > encoding is, if it does not provide significant gains, then the > specification clutter, implementation costs and the potential of > compatibility problems greatly outweigh its advantages. For this > reason, I would say that only encodings that provide significant gains > are worth adding. As far as I am concerned, such a new encoding would > be a welcome addition to Parquet. > > Thanks, > > Zoltan > > On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin > <[email protected]<mailto:[email protected]>> wrote: >> >> Dear all, >> >> thank you for your work on the Apache Parquet format. >> >> We are a group of students at the Technical University of Munich who would >> like to extend the available compression and encoding options for 32-bit and >> 64-bit floating point data in Apache Parquet. >> The current encodings and compression algorithms offered in Apache Parquet >> are heavily specialized towards integer and text data. >> Thus there is an opportunity in reducing both io throughput requirements and >> space requirements for handling floating point data by selecting a >> specialized compression algorithm. >> >> Currently, I am doing an investigation on the available literature and >> publicly available fp compressors. In my investigation I am writing a report >> on my findings - the available algorithms, their strengths and weaknesses, >> compression rates, compression speeds and decompression speeds, and >> licenses. Once finished I will share the report with you and make a proposal >> which ones IMO are good candidates for Apache Parquet. >> >> The goal is to add a solution for both 32-bit and 64-bit fp types. I think >> that it would be beneficial to offer at the very least two distinct paths. >> The first one should offer fast compression and decompression speed with >> some but not significant saving in space. The second one should offer slower >> compression and decompression speed but with a decent compression rate. Both >> lossless. A lossy path will be investigated further and discussed with the >> community. >> >> If I get an approval from you – the developers – I can continue with adding >> support for the new encoding/compression options in the C++ implementation >> of Apache Parquet in Apache Arrow. >> >> Please let me know what you think of this idea and whether you have any >> concerns with the plan. >> >> Best regards, >> Martin Radev
