Sounds good, thanks for working on this! On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <[email protected]> wrote:
> Hello Ryan, > > > we decided that it would be beneficial to try out your proposal. > > > I will look into it and provide measurements on the compression ratio and > speed. > > > Regards, > > Martin > ------------------------------ > *From:* Ryan Blue <[email protected]> > *Sent:* Saturday, September 14, 2019 2:23:20 AM > *To:* Radev, Martin > *Cc:* Parquet Dev; Raoofy, Amir; Karlstetter, Roman > *Subject:* Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet > > > Using RLE for the sign, exponents and the top-most mantissa bytes can > help when data is repetitive and make it worse for other. > > I agree. But we use RLE in similar cases because we do tend to have runs > of values, and values that fit in a fixed number of bits. Exponents and > sign bits would probably fit this model extremely well most of the time if > you have similar floating point values or sorted values. It would be really > interesting to see how well this performs in comparison to the compression > tests you've already done. For mantissa bits, I agree it wouldn't be worth > encoding first. > > On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <[email protected]> wrote: > >> Hello Ryan, Wes and other parquet devs, >> >> >> thanks for the response. I was away on vacation and that's why I am >> answering just now. >> >> >> > whether you think adding run-length encoding to any of the byte >> streams would be beneficial before applying Zstd. >> The short answer is "only for some cases but it will make it worse in >> both compression ratio and speed for other". >> >> Our initial investigation also separated the sign, exponent and mantissa >> into separate streams. >> >> The encoding was the following assuming 32-bit IEEE754: >> >> - stream of sign bits >> >> - stream of exponents bits. Conveniently the exponent for a 32-bit >> IEEE754 number is 8 bits. >> >> - separate the remaining 23 bits into four streams of 8, 8, 7 bits. An >> extra zero bit is added to the block which has only seven bits. This was >> done since zstd, zlib, etc work at a byte granularity and we would >> want repetitions to happen at such. >> >> For 64-bit IEEE754 even more padding has to be added since the exponent >> is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more >> exponent bits and 4 more mantissa bits to keep repetitions at a byte >> granularity. My original report shows results for when the floating-point >> values are split at a component granularity. Report is here: >> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view >> Results are just slightly better in terms of compression ratio for some >> tests but compression and decompression speed is expectedly worse. The >> reason is that splitting a value is somewhat more complex. We need to keep >> a stream of bits for the signs, keep track of when a byte in the stream is >> exhausted, do bit manipulation to extract components, etc. This is also the >> reason why I preferred to go with the byte-wise decomposition of the >> values. It's faster and the compression ratio is just slightly worse for >> some of the tests. >> >> >> Using RLE for the sign, exponents and the top-most mantissa bytes can >> help when data is repetitive and make it worse for other. I suppose using >> one of the compressors yields a better compression ratio on average. Also, >> this can again make encoding and decoding slower. >> >> >> The design of the BYTE_STREAM_SPLIT encoding had in mind two things: >> >> - It would only make data more compressible and leave compression to the >> codec in use. >> This leaves the complexity to the codec and choice of speed/compression >> ratio to the user. >> >> - It should be fast. >> There's an extra compression step so preferably there's very little >> latency before it. >> >> @Wes, can you have a look? >> >> More opinions are welcome. >> >> If you have floating point data available, I would be very happy to >> examine whether this approach offers benefit for you. >> >> >> Regards, >> >> Martin >> >> ------------------------------ >> *From:* Ryan Blue <[email protected]> >> *Sent:* Tuesday, September 3, 2019 11:51:46 PM >> *To:* Parquet Dev >> *Cc:* Raoofy, Amir; Karlstetter, Roman >> *Subject:* Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet >> >> Hi Martin, >> >> Thanks for taking a look at this! I agree that the approach here looks >> promising. We've had occasional requests for lossy floating point >> compression in the past, so it would be good to add this. >> >> I did some work in this area a few years ago that is similar and I'd like >> to hear what you think about that approach compared to this one. That work >> was based on the same observation, that the main problem is the mantissa, >> while exponents tend to compress well. What I did was take the exponent >> and >> mantissa and encode each separately, like the component encoding in your >> test. But to encode each stream, I used Parquet's RLE encoder instead of >> just applying compression. This seemed to work well for exponents and sign >> bits, but probably isn't worth the cost for mantissa bits. It could also >> be >> interesting to test a separate stream for sign bits. >> >> I guess what I'd like to hear your take on is whether you think adding >> run-length encoding to any of the byte streams would be beneficial before >> applying Zstd. >> >> Thanks! >> >> rb >> >> On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <[email protected]> wrote: >> >> > I'm interested in this. I have been busy the last couple of weeks so >> have >> > not been able to take a closer look. I will try to give some feedback >> this >> > week. >> > >> > Thanks >> > >> > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <[email protected]> wrote: >> > >> > > Hello all, >> > > >> > > >> > > thank you Julien for the interest. >> > > >> > > >> > > Could other people, part of Apache Parquet, share their opinions? >> > > >> > > Do you have your own data which you would like to use for testing the >> new >> > > encoding? >> > > >> > > >> > > Regards, >> > > >> > > Martin >> > > >> > > ________________________________ >> > > From: Julien Le Dem <[email protected]> >> > > Sent: Friday, August 30, 2019 2:38:37 AM >> > > To: [email protected] >> > > Cc: Raoofy, Amir; Karlstetter, Roman >> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet >> > > >> > > I think this looks promising to me. At first glance it seems combining >> > > simplicity and efficiency. >> > > I'd like to hear more from other members of the PMC. >> > > >> > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <[email protected]> >> > wrote: >> > > >> > > > Dear all, >> > > > >> > > > >> > > > there was some earlier discussion on adding a new encoding for >> better >> > > > compression of FP32 and FP64 data. >> > > > >> > > > >> > > > The pull request which extends the format is here: >> > > > https://github.com/apache/parquet-format/pull/144 >> > > > The change has one approval from earlier from Zoltan. >> > > > >> > > > >> > > > The results from an investigation on compression ratio and speed >> with >> > the >> > > > new encoding vs other encodings is available here: >> > > > https://github.com/martinradev/arrow-fp-compression-bench >> > > > It is visible that for many tests the new encoding performs better >> in >> > > > compression ratio and in some cases in speed. The improvements in >> > > > compression speed come from the fact that the new format can >> > potentially >> > > > lead to a faster parsing for some compressors like GZIP. >> > > > >> > > > >> > > > An earlier report which examines other FP compressors (fpzip, spdp, >> > fpc, >> > > > zfp, sz) and new potential encodings is available here: >> > > > >> > > >> > >> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing >> > > > The report also covers lossy compression but the BYTE_STREAM_SPLIT >> > > > encoding only has the focus of lossless compression. >> > > > >> > > > >> > > > Can we have a vote? >> > > > >> > > > >> > > > Regards, >> > > > >> > > > Martin >> > > > >> > > > >> > > >> > >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix
