Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Ryan Blue Thu, 19 Sep 2019 10:55:45 -0700

Sounds good, thanks for working on this!

On Thu, Sep 19, 2019 at 6:10 AM Radev, Martin <[email protected]> wrote:


> Hello Ryan,
>
>
> we decided that it would be beneficial to try out your proposal.
>
>
> I will look into it and provide measurements on the compression ratio and
> speed.
>
>
> Regards,
>
> Martin
> ------------------------------
> *From:* Ryan Blue <[email protected]>
> *Sent:* Saturday, September 14, 2019 2:23:20 AM
> *To:* Radev, Martin
> *Cc:* Parquet Dev; Raoofy, Amir; Karlstetter, Roman
> *Subject:* Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>
> > Using RLE for the sign, exponents and the top-most mantissa bytes can
> help when data is repetitive and make it worse for other.
>
> I agree. But we use RLE in similar cases because we do tend to have runs
> of values, and values that fit in a fixed number of bits. Exponents and
> sign bits would probably fit this model extremely well most of the time if
> you have similar floating point values or sorted values. It would be really
> interesting to see how well this performs in comparison to the compression
> tests you've already done. For mantissa bits, I agree it wouldn't be worth
> encoding first.
>
> On Thu, Sep 12, 2019 at 2:56 AM Radev, Martin <[email protected]> wrote:
>
>> Hello Ryan, Wes and other parquet devs,
>>
>>
>> thanks for the response. I was away on vacation and that's why I am
>> answering just now.
>>
>>
>> > whether you think adding run-length encoding to any of the byte
>> streams would be beneficial before applying Zstd.
>> The short answer is "only for some cases but it will make it worse in
>> both compression ratio and speed for other".
>>
>> Our initial investigation also separated the sign, exponent and mantissa
>> into separate streams.
>>
>> The encoding was the following assuming 32-bit IEEE754:
>>
>> - stream of sign bits
>>
>> - stream of exponents bits. Conveniently the exponent for a 32-bit
>> IEEE754 number is 8 bits.
>>
>> - separate the remaining 23 bits into four streams of 8, 8, 7 bits. An
>> extra zero bit is added to the block which has only seven bits. This was
>> done since zstd, zlib, etc work at a byte granularity and we would
>> want repetitions to happen at such.
>>
>> For 64-bit IEEE754 even more padding has to be added since the exponent
>> is 11 bits and the mantissa is 52 bits. Thus, we have to add 5 more
>> exponent bits and 4 more mantissa bits to keep repetitions at a byte
>> granularity. My original report shows results for when the floating-point
>> values are split at a component granularity. Report is here:
>> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
>> Results are just slightly better in terms of compression ratio for some
>> tests but compression and decompression speed is expectedly worse. The
>> reason is that splitting a value is somewhat more complex. We need to keep
>> a stream of bits for the signs, keep track of when a byte in the stream is
>> exhausted, do bit manipulation to extract components, etc. This is also the
>> reason why I preferred to go with the byte-wise decomposition of the
>> values. It's faster and the compression ratio is just slightly worse for
>> some of the tests.
>>
>>
>> Using RLE for the sign, exponents and the top-most mantissa bytes can
>> help when data is repetitive and make it worse for other. I suppose using
>> one of the compressors yields a better compression ratio on average. Also,
>> this can again make encoding and decoding slower.
>>
>>
>> The design of the BYTE_STREAM_SPLIT encoding had in mind two things:
>>
>> - It would only make data more compressible and leave compression to the
>> codec in use.
>>   This leaves the complexity to the codec and choice of speed/compression
>> ratio to the user.
>>
>> - It should be fast.
>>   There's an extra compression step so preferably there's very little
>> latency before it.
>>
>> @Wes, can you have a look?
>>
>> More opinions are welcome.
>>
>> If you have floating point data available, I would be very happy to
>> examine whether this approach offers benefit for you.
>>
>>
>> Regards,
>>
>> Martin
>>
>> ------------------------------
>> *From:* Ryan Blue <[email protected]>
>> *Sent:* Tuesday, September 3, 2019 11:51:46 PM
>> *To:* Parquet Dev
>> *Cc:* Raoofy, Amir; Karlstetter, Roman
>> *Subject:* Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>>
>> Hi Martin,
>>
>> Thanks for taking a look at this! I agree that the approach here looks
>> promising. We've had occasional requests for lossy floating point
>> compression in the past, so it would be good to add this.
>>
>> I did some work in this area a few years ago that is similar and I'd like
>> to hear what you think about that approach compared to this one. That work
>> was based on the same observation, that the main problem is the mantissa,
>> while exponents tend to compress well. What I did was take the exponent
>> and
>> mantissa and encode each separately, like the component encoding in your
>> test. But to encode each stream, I used Parquet's RLE encoder instead of
>> just applying compression. This seemed to work well for exponents and sign
>> bits, but probably isn't worth the cost for mantissa bits. It could also
>> be
>> interesting to test a separate stream for sign bits.
>>
>> I guess what I'd like to hear your take on is whether you think adding
>> run-length encoding to any of the byte streams would be beneficial before
>> applying Zstd.
>>
>> Thanks!
>>
>> rb
>>
>> On Tue, Sep 3, 2019 at 12:30 PM Wes McKinney <[email protected]> wrote:
>>
>> > I'm interested in this. I have been busy the last couple of weeks so
>> have
>> > not been able to take a closer look. I will try to give some feedback
>> this
>> > week.
>> >
>> > Thanks
>> >
>> > On Tue, Sep 3, 2019, 2:17 PM Radev, Martin <[email protected]> wrote:
>> >
>> > > Hello all,
>> > >
>> > >
>> > > thank you Julien for the interest.
>> > >
>> > >
>> > > Could other people, part of Apache Parquet, share their opinions?
>> > >
>> > > Do you have your own data which you would like to use for testing the
>> new
>> > > encoding?
>> > >
>> > >
>> > > Regards,
>> > >
>> > > Martin
>> > >
>> > > ________________________________
>> > > From: Julien Le Dem <[email protected]>
>> > > Sent: Friday, August 30, 2019 2:38:37 AM
>> > > To: [email protected]
>> > > Cc: Raoofy, Amir; Karlstetter, Roman
>> > > Subject: Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet
>> > >
>> > > I think this looks promising to me. At first glance it seems combining
>> > > simplicity and efficiency.
>> > > I'd like to hear more from other members of the PMC.
>> > >
>> > > On Tue, Aug 27, 2019 at 5:30 AM Radev, Martin <[email protected]>
>> > wrote:
>> > >
>> > > > Dear all,
>> > > >
>> > > >
>> > > > there was some earlier discussion on adding a new encoding for
>> better
>> > > > compression of FP32 and FP64 data.
>> > > >
>> > > >
>> > > > The pull request which extends the format is here:
>> > > > https://github.com/apache/parquet-format/pull/144
>> > > > The change has one approval from earlier from Zoltan.
>> > > >
>> > > >
>> > > > The results from an investigation on compression ratio and speed
>> with
>> > the
>> > > > new encoding vs other encodings is available here:
>> > > > https://github.com/martinradev/arrow-fp-compression-bench
>> > > > It is visible that for many tests the new encoding performs better
>> in
>> > > > compression ratio and in some cases in speed. The improvements in
>> > > > compression speed come from the fact that the new format can
>> > potentially
>> > > > lead to a faster parsing for some compressors like GZIP.
>> > > >
>> > > >
>> > > > An earlier report which examines other FP compressors (fpzip, spdp,
>> > fpc,
>> > > > zfp, sz) and new potential encodings is available here:
>> > > >
>> > >
>> >
>> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view?usp=sharing
>> > > > The report also covers lossy compression but the BYTE_STREAM_SPLIT
>> > > > encoding only has the focus of lossless compression.
>> > > >
>> > > >
>> > > > Can we have a vote?
>> > > >
>> > > >
>> > > > Regards,
>> > > >
>> > > > Martin
>> > > >
>> > > >
>> > >
>> >
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Add BYTE_STREAM_SPLIT encoding to Apache Parquet

Reply via email to