Re: Floating point data compression for Apache Parquet

Wes McKinney Tue, 16 Jul 2019 04:45:27 -0700

I think first you need to create a [DISCUSS] thread whose subject
clearly indicates that you are proposing to modify the Parquet format.
You should link to the PR with the changes to parquet-format. Then
wait for feedback to collect.


Frankly, I was surprised to see a PR close to being merged based
largely on a 2-way discussion between Martin and Zoltan (unless I
missed something)

On Tue, Jul 16, 2019 at 2:11 AM Roman Karlstetter
<[email protected]> wrote:
>
> Hi Wes,
>
> what would be the formal or informal requirements for such a vote to pass?
> What is needed in terms of code and specification before we can start such
> a vote?
>
> Roman
>
> Am Fr., 12. Juli 2019 um 17:07 Uhr schrieb Wes McKinney <[email protected]
> >:
>
> > I think we need to vote to make any changes to the Parquet format. New
> > features carry a heavy responsibility
> >
> > On Fri, Jul 12, 2019 at 10:04 AM Michael Heuer <[email protected]> wrote:
> > >
> > > Hello Martin,
> > >
> > > I'm willing to run some tests at scale on our genomics data when a
> > parquet-mr pull request for the Java implementation is ready.
> > >
> > > Cheers,
> > >
> > >    michael
> > >
> > >
> > > > On Jul 11, 2019, at 1:09 PM, Radev, Martin <[email protected]>
> > wrote:
> > > >
> > > > Dear all,
> > > >
> > > >
> > > > I created a Jira issue for the new feature and also made a pull
> > request for my patch which extends the format and documentation.
> > > >
> > > > Jira issue: https://issues.apache.org/jira/browse/PARQUET-1622 <
> > https://issues.apache.org/jira/browse/PARQUET-1622>
> > > > Pull request: https://github.com/apache/parquet-format/pull/144 <
> > https://github.com/apache/parquet-format/pull/144>
> > > >
> > > >
> > > > I also have a WIP patch for adding the "BYTE_STREAM_SPLIT" encoding to
> > parquet-cpp within Apache Arrow.
> > > >
> > > >
> > > > How should we proceed?
> > > >
> > > > It would be great to get feedback from other community members.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > >
> > > >
> > > >
> > > > ________________________________
> > > > From: Radev, Martin <[email protected] <mailto:[email protected]>>
> > > > Sent: Tuesday, July 9, 2019 1:01:25 AM
> > > > To: Zoltan Ivanfi
> > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: Floating point data compression for Apache Parquet
> > > >
> > > > Hello Zoltan,
> > > >
> > > >
> > > > I can provide a C++ and Java implementation for the encoder.
> > > >
> > > > The encoder/decoder is very small, and naturally I have to add tests.
> > > >
> > > > I expect the biggest hurdle would be setting up the environment and
> > reading though the developer guides.
> > > >
> > > >
> > > > I will write my patches for Apache Arrow and for Apache Parquet and
> > send them for review.
> > > >
> > > > After getting them in, I can continue with the Java implementation.
> > > >
> > > > Let me know if you have any concerns.
> > > >
> > > >
> > > > It would be great to get an opinion from other Parquet contributors : )
> > > >
> > > >
> > > > Thank you for the feedback!
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Zoltan Ivanfi <[email protected]>
> > > > Sent: Monday, July 8, 2019 5:06:30 PM
> > > > To: Radev, Martin
> > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: Floating point data compression for Apache Parquet
> > > >
> > > > Hi Martin,
> > > >
> > > > I agree that bs_zstd would be a good place to start. Regarding the
> > choice of language, Java, C++ and Python are your options. As far as I
> > know, the Java implementation of Parquet has more users from the business
> > sector, where decimal is preferred over floating point data types. It is
> > also much more tightly integrated with the Hadoop ecosystem (it is even
> > called parquet-mr, as in MapReduce), making for a steeper learning curve.
> > > >
> > > > The Python and C++ language bindings have more scientific users, so
> > users of these may be more interested in the new encodings. Python is a
> > good language for rapid prototyping as well, but the Python binding of
> > Parquet may use the C++ library under the hood, I'm not sure (I'm more
> > familiar with the Java implementation). In any case, there are at least two
> > Python bindings: pyarrow and fastparquet.
> > > >
> > > > I think we can extend the format before the actual implementations are
> > ready, provided that the specification is clear and nobody objects to
> > adding it to the format. For this, I would wait for the opinion of a few
> > more Parquet developers first, since changes to the format that are only
> > supported by a single committer usually have a hard time getting into the
> > spec. Additionally, could you please clarify which language bindings you
> > plan to implement yourself? This will help the developers of the different
> > language bindings assess how much work they will have to do to add support.
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > >
> > > > On Fri, Jul 5, 2019 at 4:34 PM Radev, Martin <[email protected]
> > <mailto:[email protected]>> wrote:
> > > >
> > > > Hello Zoltan and Parquet devs,
> > > >
> > > >
> > > > do you think it would be appropriate to start with a Parquet prototype
> > from my side?
> > > >
> > > > I suspect that integrating 'bs_zstd' would be the simplest to
> > integrate and from the report we can see an improvement in both ratio and
> > speed.
> > > >
> > > >
> > > > Do you think that Apache Arrow is an appropriate place to prototype
> > the extension of the format?
> > > >
> > > > Do you agree that the enum field 'Encodings' is a suitable place to
> > add the 'Byte stream-splitting transformation'? In that way it could be
> > used with any of the other supported compressors.
> > > >
> > > > It might be best to also add a Java implementation of the
> > transformation. Would the project 'parquet-mr' be a good place?
> > > >
> > > >
> > > > Would the workflow be such that I write my patches, we verify for
> > correctness, get reviews, merge them AND just then we make adjustments to
> > the Apache Parquet spec?
> > > >
> > > >
> > > > Any piece of advice is welcome!
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > >
> > > > ________________________________
> > > > From: Zoltan Ivanfi <[email protected]<mailto:[email protected]>>
> > > > Sent: Friday, July 5, 2019 4:21:39 PM
> > > > To: Radev, Martin
> > > > Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: Floating point data compression for Apache Parquet
> > > >
> > > >
> > > > Hi Martin,
> > > >
> > > > Thanks for the explanations, makes sense. Nice work!
> > > >
> > > > Br,
> > > >
> > > > Zoltan
> > > >
> > > > On Thu, Jul 4, 2019 at 12:22 AM Radev, Martin <[email protected]
> > <mailto:[email protected]>> wrote:
> > > >
> > > > Hello Zoltan,
> > > >
> > > >
> > > >> Is data pre-loaded to RAM before making the measurements?
> > > > Yes, the file is read into physical memory.
> > > >
> > > > For mmap-ed files, read from external storage, I would expect, but not
> > 100% sure, that the IO-overhead would be big enough that all algorithms
> > compress quite close at the same speed.
> > > >
> > > >
> > > >> In "Figure 3: Decompression speed in MB/s", is data size measured
> > before or after uncompression?
> > > >
> > > >> In "Figure 4: Compression speed in MB/s", is data size measured
> > before or after compression?
> > > > For both the reported result is "size of the original file / time to
> > compress or decompress".
> > > >
> > > >> According to "Figure 3: Decompression speed in MB/s", decompression
> > of bs_zstd is almost twice as fast as plain zstd. Do you know what causes
> > this massive speed improvement?
> > > >
> > > > I do not know all of the details. As you mentioned, the written out
> > data is less, this could potentially lead to improvement in speed as less
> > data has to be written out to memory during compression or read from memory
> > during decompression.
> > > >
> > > > Another thing to consider is that ZSTD uses different techniques to
> > compress a block of data - "raw", "RLE", "Huffman coding", "Treeless
> > coding".
> > > >
> > > > I expect that "Huffman coding" is more costly than "RLE" and I also
> > expect that "RLE" to be applicable for the majority of the sign bits thus
> > leading to a performance win for when the transformation is applied.
> > > >
> > > >
> > > > I also expect that zstd has to do some form of "optimal parsing" to
> > decide how to process the input in order to compress it well. This is
> > something every wanna-be-good LZ-like compressor has to do (
> > https://martinradev.github.io/jekyll/update/2019/05/29/writing-a-pe32-x86-exe-packer.html
> > ,
> > http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > ). It might be so that the transformed input is somehow easy which leads to
> > faster compression rates and also easier to decompress data which leads to
> > faster decompression rates.
> > > > cbloom rants: 10-24-11 - LZ Optimal Parse with A Star Part 1<
> > http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > <
> > http://cbloomrants.blogspot.com/2011/10/10-24-11-lz-optimal-parse-with-star.html
> > >>
> > > > cbloomrants.blogspot.com <http://cbloomrants.blogspot.com/>
> > > > First two notes that aren't about the A-Star parse : 1. All good LZ
> > encoders these days that aren't optimal parsers use complicated heuri...
> > > >
> > > >
> > > >
> > > >
> > > > I used this as a reference:
> > https://www.rfc-editor.org/rfc/pdfrfc/rfc8478.txt.pdf. I am not familiar
> > with ZSTD in particular.
> > > >
> > > >
> > > > I also checked that the majority of the time is spent in zstd.
> > > >
> > > > Example run for msg_sweep3d.dp using zstd at level 1.
> > > > - Transformation during compression: 0.086s, ZSTD compress on
> > transformed data: 0.08s
> > > >
> > > > - regular ZSTD: 0.34s
> > > > - ZSTD decompress from compressed transformed data: 0.067s,
> > Transformation during decompression: 0.021s
> > > > - regular ZSTD decompress: 0.24s
> > > >
> > > >
> > > > Example run for msg_sweep3d.dp using zstd at level 20.
> > > >
> > > > - Transformation during compression: 0.083s, ZSTD compress on
> > transformed data: 14.35s
> > > >
> > > > - regular ZSTD: 183s
> > > > - ZSTD decompress from compressed transformed data: 0.075s,
> > Transformation during decompression: 0.022s
> > > > - regular ZSTD decompress: 0.31s
> > > > Here it's clear that the transformed input is easier to parse
> > (compress). Maybe also the blocks are of type which takes less time to
> > decompress.
> > > >
> > > >> If considering using existing libraries to provide any of the
> > compression algorithms, license compatibility is also an important factor
> > and therefore would be worth mentioning in Section 5.
> > > > This is something I forgot to list. I will back to you and the other
> > devs with information.
> > > >
> > > > The filter I proposed for lossless compression can be integrated
> > without any concerns for a license.
> > > >
> > > >
> > > >> Are any of the investigated strategies applicable to DECIMAL values?
> > > > The lossy compressors SZ and ZFP do not support that outside of the
> > box. I could communicate with the SZ developers to come to a decision how
> > this can be added to SZ. An option is to losslessly compress the
> > pre-decimal number and lossyly compress the post-decimal number.
> > > >
> > > > For lossless compression, we can apply a similar stream splitting
> > technique for decimal types though it might be somewhat more complex and I
> > have not really though about this case.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Zoltan Ivanfi <[email protected]<mailto:[email protected]>>
> > > > Sent: Wednesday, July 3, 2019 6:07:50 PM
> > > > To: Parquet Dev; Radev, Martin
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: Floating point data compression for Apache Parquet
> > > >
> > > > Hi Martin,
> > > >
> > > > Thanks for the thorough investigation, very nice report. I would have
> > a few questions:
> > > >
> > > > - Is data pre-loaded to RAM before making the measurements?
> > > >
> > > > - In "Figure 3: Decompression speed in MB/s", is data size measured
> > before or after uncompression?
> > > >
> > > > - In "Figure 4: Compression speed in MB/s", is data size measured
> > before or after compression?
> > > >
> > > > - According to "Figure 3: Decompression speed in MB/s", decompression
> > of bs_zstd is almost twice as fast as plain zstd. Do you know what causes
> > this massive speed improvement? Based on the description provided in
> > section 3.2, bs_zstd uses the same zstd compression with an extra step of
> > splitting/combining streams. Since this is extra work, I would have
> > expected bs_zstd to be slower than pure zstd, unless the compressed data
> > becomes so much smaller that it radically improves data access times.
> > However, according to "Figure 2: Compression ratio", bs_zstd achieves
> > "only" 23% better compression than plain zstd, which can not be the reason
> > for the 2x speed-up in itself.
> > > >
> > > > - If considering using existing libraries to provide any of the
> > compression algorithms, license compatibility is also an important factor
> > and therefore would be worth mentioning in Section 5.
> > > >
> > > > - Are any of the investigated strategies applicable to DECIMAL values?
> > Since floating point values and calculations have an inherent inaccuracy,
> > the DECIMAL type is much more important for storing financial data, which
> > is one of the main use cases of Parquet.
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > > On Mon, Jul 1, 2019 at 10:57 PM Radev, Martin <[email protected]
> > <mailto:[email protected]>> wrote:
> > > > Hello folks,
> > > >
> > > >
> > > > thank you for your input.
> > > >
> > > >
> > > > I am finished with my investigation regarding introducing special
> > support for FP compression in Apache Parquet.
> > > >
> > > > My report also includes an investigation of lossy compressors though
> > there are still some things to be cleared out.
> > > >
> > > >
> > > > Report:
> > https://drive.google.com/open?id=1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
> > > >
> > > >
> > > > Sections 3 4 5 6 are the most important to go over.
> > > >
> > > >
> > > > Let me know if you have any questions or concerns.
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Martin
> > > >
> > > > ________________________________
> > > > From: Zoltan Ivanfi <[email protected]>
> > > > Sent: Thursday, June 13, 2019 2:16:56 PM
> > > > To: Parquet Dev
> > > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > > Subject: Re: Floating point data compression for Apache Parquet
> > > >
> > > > Hi Martin,
> > > >
> > > > Thanks for your interest in improving Parquet. Efficient encodings are
> > > > really important in a big data file format, so this topic is
> > > > definitely worth researching and personally I am looking forward to
> > > > your report. Whether to add any new encodings to Parquet, however, can
> > > > not be answered until we see the results of your findings.
> > > >
> > > > You mention two paths. One has very small computational overhead but
> > > > does not provide significant space savings. The other provides
> > > > significant space savings but at the price of a significant
> > > > computational overhead. While purely based on these properties both of
> > > > them seem "balanced" (one is small effort, small gain; the other is
> > > > large effort, large gain) and therefore sound reasonable options, I
> > > > would argue that one should also consider development costs, code
> > > > complexity and compatibility implications when deciding about whether
> > > > a new feature is worth implementing.
> > > >
> > > > Adding a new encoding or compression to Parquet complicates the
> > > > specification of the file format and requires implementing it in every
> > > > language binding of the format, which is not only a considerable
> > > > effort, but is also error-prone (see LZ4 for an example, which was
> > > > added to both the Java and the C++ implementation of Parquet, yet they
> > > > are incompatible with each other). And lack of support is not only a
> > > > minor annoyance in this case: if one is forced to use an older reader
> > > > that does not support the new encoding yet (or a language binding that
> > > > does not support it at all), the data simply can not be read.
> > > >
> > > > In my opinion, no matter how low the computational overhead of a new
> > > > encoding is, if it does not provide significant gains, then the
> > > > specification clutter, implementation costs and the potential of
> > > > compatibility problems greatly outweigh its advantages. For this
> > > > reason, I would say that only encodings that provide significant gains
> > > > are worth adding. As far as I am concerned, such a new encoding would
> > > > be a welcome addition to Parquet.
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > > On Wed, Jun 12, 2019 at 11:10 PM Radev, Martin <[email protected]
> > <mailto:[email protected]>> wrote:
> > > >>
> > > >> Dear all,
> > > >>
> > > >> thank you for your work on the Apache Parquet format.
> > > >>
> > > >> We are a group of students at the Technical University of Munich who
> > would like to extend the available compression and encoding options for
> > 32-bit and 64-bit floating point data in Apache Parquet.
> > > >> The current encodings and compression algorithms offered in Apache
> > Parquet are heavily specialized towards integer and text data.
> > > >> Thus there is an opportunity in reducing both io throughput
> > requirements and space requirements for handling floating point data by
> > selecting a specialized compression algorithm.
> > > >>
> > > >> Currently, I am doing an investigation on the available literature
> > and publicly available fp compressors. In my investigation I am writing a
> > report on my findings - the available algorithms, their strengths and
> > weaknesses, compression rates, compression speeds and decompression speeds,
> > and licenses. Once finished I will share the report with you and make a
> > proposal which ones IMO are good candidates for Apache Parquet.
> > > >>
> > > >> The goal is to add a solution for both 32-bit and 64-bit fp types. I
> > think that it would be beneficial to offer at the very least two distinct
> > paths. The first one should offer fast compression and decompression speed
> > with some but not significant saving in space. The second one should offer
> > slower compression and decompression speed but with a decent compression
> > rate. Both lossless. A lossy path will be investigated further and
> > discussed with the community.
> > > >>
> > > >> If I get an approval from you – the developers – I can continue with
> > adding support for the new encoding/compression options in the C++
> > implementation of Apache Parquet in Apache Arrow.
> > > >>
> > > >> Please let me know what you think of this idea and whether you have
> > any concerns with the plan.
> > > >>
> > > >> Best regards,
> > > >> Martin Radev
> > >
> >

Re: Floating point data compression for Apache Parquet

Reply via email to