Re: [C++][Parquet] Support different compression algorithms per row group

Aldrin Sat, 23 Mar 2024 11:04:36 -0700

Hi Andrei,
I tried finding more details on block compression in parquet (or compression 
per data page) and I couldn't find anything to satisfy my curiosity about how 
it can be used and how it performs.
I hate being the person to just say "test it first," so I want to also 
recommend figuring out how you'd imagine the interface to be designed. Some 
formats like ORC seem to have 2 compression modes (optimize for speed or space) 
while parquet exposes more of the tuning knobs (according to [1]). And to 
Gang's point, there's a question of what can be exposed to the various 
abstraction levels (perhaps end users would never be interested in this so it's 
exposed only through an advanced or internal interface).
Anyways, good luck scoping it out and feel free to iterate with the mailing 
list as you try things out rather than just when finished, maybe someone can 
chime in with more information and thoughts in the meantime.
[1]: https://arxiv.org/pdf/2304.05028.pdf
 Sent from Proton Mail for iOS 
On Sat, Mar 23, 2024 at 05:23, Andrei Lazăr &lt;[email protected]&gt; 
wrote:  Hi Aldrin,  thanks for taking the time to reply to my email!


In my understanding, compression on Parquet files happens on the Data Page
level for every column, meaning that even across a row group, there can be
multiple units of data compression, and most certainly there are going to
be different units of data compression across an entire Parquet file.
Therefore, what I am hoping for is that more granular compression algorithm
choices could lead to overall better compression as the data in the same
column across row groups can differ quite a lot.

At this very moment, specifying different compression algorithms per column
is supported and in my use case it is extremely helpful, as I have some
columns (mostly containing floats), for which a compression algorithm like
Snappy (or even no compression at all), significantly speeds up my queries
than keeping the data compressed with something like ZSTD or GZIP.

That being said, your suggestion of writing a benchmark and sharing the
results here to support considering this approach is a great idea, I will
try doing that!

Once again, thank you for your time!

Kind regards,
Andrei

On Fri, 22 Mar 2024 at 22:12, Aldrin &lt;[email protected]&gt; wrote:

&gt; Hello!
&gt;
&gt; I don't do much with compression, so I could be wrong, but I assume a
&gt; compression algorithm spans the whole column and areas of large variance
&gt; generally benefit less from the compression, but the encoding still
&gt; provides benefits across separate areas (e.g. separate row groups).
&gt;
&gt; My impression is that compression will not be any better if it's
&gt; restricted to only a subset of the data and if it is only scoped to a
&gt; subset of the data then there are extra overheads you'd have beyond what
&gt; you normally would have (the same raw value would have the same encoded
&gt; value stored per row group). I suppose things like run-length encoding
&gt; won't be any less efficient, but it also wouldn't be any more efficient
&gt; (with the caveat of a raw value repeating across row groups).
&gt;
&gt; A different compression for different columns isn't unreasonable, so I
&gt; think I could be easily convinced that has benefits (though would require
&gt; per-column logic that could slow other things down).
&gt;
&gt; These are just my thoughts, though. Can you share the design and results
&gt; of your benchmark? Have you (or could you) prototyped anything to test it
&gt; out?
&gt;
&gt; Sent from Proton Mail &lt;https://proton.me/mail/home&gt; for iOS
&gt;
&gt;
&gt; On Fri, Mar 22, 2024 at 14:36, Andrei Lazăr &lt;[email protected]
&gt; &lt;On+Fri,+Mar+22,+2024+at+14:36,+Andrei+Lazăr+%3C%3Ca+href=&gt;&gt; 
wrote:
&gt;
&gt; Hi Gang,
&gt;
&gt; Thanks a lot for getting back to me!
&gt;
&gt; So the use case I am having is relatively simple: I was playing around with
&gt; some data and I wanted to benchmark different compression algorithms in an
&gt; effort to speed up data retrieval in a simple Parquet based database that I
&gt; am playing around with. Whilst doing so, I've noticed a very large variance
&gt; in the performance of the same compression algorithm over different row
&gt; groups in my Parquet files. Therefore, I was thinking that the best
&gt; compression configuration for my data would be to use a different algorithm
&gt; for every column, for every row group in my files. In a real-world
&gt; situation, I can see this being used by a database, either when new entries
&gt; are inserted into it, or even as a background 'optimizer' job that runs
&gt; over existing data.
&gt;
&gt; How do you feel about this?
&gt;
&gt; Thank you,
&gt; Andrei
&gt;
&gt; On Thu, 21 Mar 2024 at 02:11, Gang Wu &lt;[email protected]&gt; wrote:
&gt;
&gt; &gt; Hi Andrei,
&gt; &gt;
&gt; &gt; What is your use case? IMHO, exposing this kind of configuration
&gt; &gt; will force users to know how will the writer split row groups, which
&gt; &gt; does not look simple to me.
&gt; &gt;
&gt; &gt; Best,
&gt; &gt; Gang
&gt; &gt;
&gt; &gt; On Thu, Mar 21, 2024 at 2:25 AM Andrei Lazăr 
&lt;[email protected]&gt;
&gt; &gt; wrote:
&gt; &gt;
&gt; &gt; &gt; Hi all,
&gt; &gt; &gt;
&gt; &gt; &gt; I would like proposing adding support for writing a Parquet file 
with
&gt; &gt; &gt; different compression algorithms for every row group.
&gt; &gt; &gt;
&gt; &gt; &gt; In my understanding, the Parquet format allows this, however it 
seems
&gt; to
&gt; &gt; me
&gt; &gt; &gt; that there is no way to achieve this from the C++ implementation.
&gt; &gt; &gt;
&gt; &gt; &gt; Does anyone have any thoughts on this?
&gt; &gt; &gt;
&gt; &gt; &gt; Thank you,
&gt; &gt; &gt; Andrei
&gt; &gt; &gt;
&gt; &gt;
&gt;
&gt;

signature.asc
Description: OpenPGP digital signature

Re: [C++][Parquet] Support different compression algorithms per row group

Reply via email to