> Using RLE for the sign, exponents and the top-most mantissa bytes can help when data is repetitive and make it worse for other.
I agree. But we use RLE in similar cases because we do tend to have runs of values, and values that fit in a fixed number of bits. Exponents and sign bits would probably fit this model extremely well most of the time if you have similar floating point values or sorted values. It would be really interesting to see how well this performs in comparison to the compression tests you've already done. For mantissa bits, I agree it wouldn't be worth encoding first. On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <[email protected]> wrote: > Hello Ryan, Wes and other parquet devs, > > > thanks for the response. I was away on vacation and that's why I am > answering just now. > > > > whether you think adding run-length encoding to any of the byte streams > would be beneficial before applying Zstd. > The short answer is "only for some cases but it will make it worse in both > compression ratio and speed for other". > > Our initial investigation also separated the sign, exponent and mantissa > into separate streams. > > The encoding was the following assuming 32-bit IEEE754: > > - stream of sign bits > > - stream of exponents bits. Conveniently the exponent for a 32-bit IEEE754 > number is 8 bits. > > - separate the remaining 23 bits into four streams of 8, 8, 7 bits. An > extra zero bit is added to the block which has only seven bits. This was > done since zstd, zlib, etc work at a byte granularity and we would > want repetitions to happen at such. > > For 64-bit IEEE754 even more padding has to be added since the exponent is > 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more exponent > bits and 4 more mantissa bits to keep repetitions at a byte granularity. My > original report shows results for when the floating-point values are split > at a component granularity. Report is here: > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view > Results are just slightly better in terms of compression ratio for some > tests but compression and decompression speed is expectedly worse. The > reason is that splitting a value is somewhat more complex. We need to keep > a stream of bits for the signs, keep track of when a byte in the stream is > exhausted, do bit manipulation to extract components, etc. This is also the > reason why I preferred to go with the byte-wise decomposition of the > values. It's faster and the compression ratio is just slightly worse for > some of the tests. > > > Using RLE for the sign, exponents and the top-most mantissa bytes can help > when data is repetitive and make it worse for other. I suppose using one of > the compressors yields a better compression ratio on average. Also, this > can again make encoding and decoding slower. > > > The design of the BYTE_STREAM_SPLIT encoding had in mind two things: > > - It would only make data more compressible and leave compression to the > codec in use. > This leaves the complexity to the codec and choice of speed/compression > ratio to the user. > > - It should be fast. > There's an extra compression step so preferably there's very little > latency before it. > > @Wes, can you have a look? > > More opinions are welcome. > > If you have floating point data available, I would be very happy to > examine whether this approach offers benefit for you. > > > Regards, > > Martin > > ------------------------------ > *From:* Ryan Blue <[email protected]> > *Sent:* Tuesday, September 3, 2019 11:51:46 PM > *To:* Parquet Dev > *Cc:* Raoofy, Amir; Karlstetter, Roman > *Subject:* Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > Hi Martin, > > Thanks for taking a look at this! I agree that the approach here looks > promising. We've had occasional requests for lossy floating point > compression in the past, so it would be good to add this. > > I did some work in this area a few years ago that is similar and I'd like > to hear what you think about that approach compared to this one. That work > was based on the same observation, that the main problem is the mantissa, > while exponents tend to compress well. What I did was take the exponent and > mantissa and encode each separately, like the component encoding in your > test. But to encode each stream, I used Parquet's RLE encoder instead of > just applying compression. This seemed to work well for exponents and sign > bits, but probably isn't worth the cost for mantissa bits. It could also be > interesting to test a separate stream for sign bits. > > I guess what I'd like to hear your take on is whether you think adding > run-length encoding to any of the byte streams would be beneficial before > applying Zstd. > > Thanks! > > rb > > On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <[email protected]> wrote: > > > I'm interested in this. I have been busy the last couple of weeks so have > > not been able to take a closer look. I will try to give some feedback > this > > week. > > > > Thanks > > > > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <[email protected]> wrote: > > > > > Hello all, > > > > > > > > > thank you Julien for the interest. > > > > > > > > > Could other people, part of Apache Parquet, share their opinions? > > > > > > Do you have your own data which you would like to use for testing the > new > > > encoding? > > > > > > > > > Regards, > > > > > > Martin > > > > > > ________________________________ > > > From: Julien Le Dem <[email protected]> > > > Sent: Friday, August 30, 2019 2:38:37 AM > > > To: [email protected] > > > Cc: Raoofy, Amir; Karlstetter, Roman > > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > > > > > I think this looks promising to me. At first glance it seems combining > > > simplicity and efficiency. > > > I'd like to hear more from other members of the PMC. > > > > > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <[email protected]> > > wrote: > > > > > > > Dear all, > > > > > > > > > > > > there was some earlier discussion on adding a new encoding for better > > > > compression of FP32 and FP64 data. > > > > > > > > > > > > The pull request which extends the format is here: > > > > https://github.com/apache/parquet-format/pull/144 > > > > The change has one approval from earlier from Zoltan. > > > > > > > > > > > > The results from an investigation on compression ratio and speed with > > the > > > > new encoding vs other encodings is available here: > > > > https://github.com/martinradev/arrow-fp-compression-bench > > > > It is visible that for many tests the new encoding performs better in > > > > compression ratio and in some cases in speed. The improvements in > > > > compression speed come from the fact that the new format can > > potentially > > > > lead to a faster parsing for some compressors like GZIP. > > > > > > > > > > > > An earlier report which examines other FP compressors (fpzip, spdp, > > fpc, > > > > zfp, sz) and new potential encodings is available here: > > > > > > > > > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing > > > > The report also covers lossy compression but the BYTE_STREAM_SPLIT > > > > encoding only has the focus of lossless compression. > > > > > > > > > > > > Can we have a vote? > > > > > > > > > > > > Regards, > > > > > > > > Martin > > > > > > > > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix
