Hello Ryan,
we decided that it would be beneficial to try out your proposal. I will look into it and provide measurements on the compression ratio and speed. Regards, Martin ________________________________ From: Ryan Blue <rb...@netflix.com> Sent: Saturday, September 14, 2019 2:23:20 AM To: Radev, Martin Cc: Parquet Dev; Raoofy, Amir; Karlstetter, Roman Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > Using RLE for the sign, exponents and the top-most mantissa bytes can help > when data is repetitive and make it worse for other. I agree. But we use RLE in similar cases because we do tend to have runs of values, and values that fit in a fixed number of bits. Exponents and sign bits would probably fit this model extremely well most of the time if you have similar floating point values or sorted values. It would be really interesting to see how well this performs in comparison to the compression tests you've already done. For mantissa bits, I agree it wouldn't be worth encoding first. On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <martin.ra...@tum.de<mailto:martin.ra...@tum.de>> wrote: Hello Ryan, Wes and other parquet devs, thanks for the response. I was away on vacation and that's why I am answering just now. > whether you think adding run-length encoding to any of the byte streams would > be beneficial before applying Zstd. The short answer is "only for some cases but it will make it worse in both compression ratio and speed for other". Our initial investigation also separated the sign, exponent and mantissa into separate streams. The encoding was the following assuming 32-bit IEEE754: - stream of sign bits - stream of exponents bits. Conveniently the exponent for a 32-bit IEEE754 number is 8 bits. - separate the remaining 23 bits into four streams of 8, 8, 7 bits. An extra zero bit is added to the block which has only seven bits. This was done since zstd, zlib, etc work at a byte granularity and we would want repetitions to happen at such. For 64-bit IEEE754 even more padding has to be added since the exponent is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more exponent bits and 4 more mantissa bits to keep repetitions at a byte granularity. My original report shows results for when the floating-point values are split at a component granularity. Report is here: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view Results are just slightly better in terms of compression ratio for some tests but compression and decompression speed is expectedly worse. The reason is that splitting a value is somewhat more complex. We need to keep a stream of bits for the signs, keep track of when a byte in the stream is exhausted, do bit manipulation to extract components, etc. This is also the reason why I preferred to go with the byte-wise decomposition of the values. It's faster and the compression ratio is just slightly worse for some of the tests. Using RLE for the sign, exponents and the top-most mantissa bytes can help when data is repetitive and make it worse for other. I suppose using one of the compressors yields a better compression ratio on average. Also, this can again make encoding and decoding slower. The design of the BYTE_STREAM_SPLIT encoding had in mind two things: - It would only make data more compressible and leave compression to the codec in use. This leaves the complexity to the codec and choice of speed/compression ratio to the user. - It should be fast. There's an extra compression step so preferably there's very little latency before it. @Wes, can you have a look? More opinions are welcome. If you have floating point data available, I would be very happy to examine whether this approach offers benefit for you. Regards, Martin ________________________________ From: Ryan Blue <rb...@netflix.com.INVALID> Sent: Tuesday, September 3, 2019 11:51:46 PM To: Parquet Dev Cc: Raoofy, Amir; Karlstetter, Roman Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet Hi Martin, Thanks for taking a look at this! I agree that the approach here looks promising. We've had occasional requests for lossy floating point compression in the past, so it would be good to add this. I did some work in this area a few years ago that is similar and I'd like to hear what you think about that approach compared to this one. That work was based on the same observation, that the main problem is the mantissa, while exponents tend to compress well. What I did was take the exponent and mantissa and encode each separately, like the component encoding in your test. But to encode each stream, I used Parquet's RLE encoder instead of just applying compression. This seemed to work well for exponents and sign bits, but probably isn't worth the cost for mantissa bits. It could also be interesting to test a separate stream for sign bits. I guess what I'd like to hear your take on is whether you think adding run-length encoding to any of the byte streams would be beneficial before applying Zstd. Thanks! rb On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <wesmck...@gmail.com<mailto:wesmck...@gmail.com>> wrote: > I'm interested in this. I have been busy the last couple of weeks so have > not been able to take a closer look. I will try to give some feedback this > week. > > Thanks > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin > <martin.ra...@tum.de<mailto:martin.ra...@tum.de>> wrote: > > > Hello all, > > > > > > thank you Julien for the interest. > > > > > > Could other people, part of Apache Parquet, share their opinions? > > > > Do you have your own data which you would like to use for testing the new > > encoding? > > > > > > Regards, > > > > Martin > > > > ________________________________ > > From: Julien Le Dem <julien.le...@wework.com.INVALID> > > Sent: Friday, August 30, 2019 2:38:37 AM > > To: dev@parquet.apache.org<mailto:dev@parquet.apache.org> > > Cc: Raoofy, Amir; Karlstetter, Roman > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > > > I think this looks promising to me. At first glance it seems combining > > simplicity and efficiency. > > I'd like to hear more from other members of the PMC. > > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin > > <martin.ra...@tum.de<mailto:martin.ra...@tum.de>> > wrote: > > > > > Dear all, > > > > > > > > > there was some earlier discussion on adding a new encoding for better > > > compression of FP32 and FP64 data. > > > > > > > > > The pull request which extends the format is here: > > > https://github.com/apache/parquet-format/pull/144 > > > The change has one approval from earlier from Zoltan. > > > > > > > > > The results from an investigation on compression ratio and speed with > the > > > new encoding vs other encodings is available here: > > > https://github.com/martinradev/arrow-fp-compression-bench > > > It is visible that for many tests the new encoding performs better in > > > compression ratio and in some cases in speed. The improvements in > > > compression speed come from the fact that the new format can > potentially > > > lead to a faster parsing for some compressors like GZIP. > > > > > > > > > An earlier report which examines other FP compressors (fpzip, spdp, > fpc, > > > zfp, sz) and new potential encodings is available here: > > > > > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing > > > The report also covers lossy compression but the BYTE_STREAM_SPLIT > > > encoding only has the focus of lossless compression. > > > > > > > > > Can we have a vote? > > > > > > > > > Regards, > > > > > > Martin > > > > > > > > > -- Ryan Blue Software Engineer Netflix -- Ryan Blue Software Engineer Netflix