Hi Andrei,
I tried finding more details on block compression in parquet (or
compression per data page) and I couldn't find anything to satisfy my
curiosity about how it can be used and how it performs.
I hate being the person to just say "test it first," so I want to also
recommend figuring out how you'd imagine the interface to be designed. Some
formats like ORC seem to have 2 compression modes (optimize for speed or
space) while parquet exposes more of the tuning knobs (according to [1]).
And to Gang's point, there's a question of what can be exposed to the
various abstraction levels (perhaps end users would never be interested in
this so it's exposed only through an advanced or internal interface).
Anyways, good luck scoping it out and feel free to iterate with the
mailing list as you try things out rather than just when finished, maybe
someone can chime in with more information and thoughts in the meantime.
[1]: https://arxiv.org/pdf/2304.05028.pdf
Sent from Proton Mail <https://proton.me/mail/home> for iOS
On Sat, Mar 23, 2024 at 05:23, Andrei Lazăr <lazarandrei...@gmail.com
<On+Sat,+Mar+23,+2024+at+05:23,+Andrei+Lazăr+%3C%3Ca+href=>> wrote:
Hi Aldrin, thanks for taking the time to reply to my email!
In my understanding, compression on Parquet files happens on the Data Page
level for every column, meaning that even across a row group, there can be
multiple units of data compression, and most certainly there are going to
be different units of data compression across an entire Parquet file.
Therefore, what I am hoping for is that more granular compression algorithm
choices could lead to overall better compression as the data in the same
column across row groups can differ quite a lot.
At this very moment, specifying different compression algorithms per column
is supported and in my use case it is extremely helpful, as I have some
columns (mostly containing floats), for which a compression algorithm like
Snappy (or even no compression at all), significantly speeds up my queries
than keeping the data compressed with something like ZSTD or GZIP.
That being said, your suggestion of writing a benchmark and sharing the
results here to support considering this approach is a great idea, I will
try doing that!
Once again, thank you for your time!
Kind regards,
Andrei
On Fri, 22 Mar 2024 at 22:12, Aldrin <octalene....@pm.me.invalid> wrote:
Hello!
I don't do much with compression, so I could be wrong, but I assume a
compression algorithm spans the whole column and areas of large variance
generally benefit less from the compression, but the encoding still
provides benefits across separate areas (e.g. separate row groups).
My impression is that compression will not be any better if it's
restricted to only a subset of the data and if it is only scoped to a
subset of the data then there are extra overheads you'd have beyond what
you normally would have (the same raw value would have the same encoded
value stored per row group). I suppose things like run-length encoding
won't be any less efficient, but it also wouldn't be any more efficient
(with the caveat of a raw value repeating across row groups).
A different compression for different columns isn't unreasonable, so I
think I could be easily convinced that has benefits (though would require
per-column logic that could slow other things down).
These are just my thoughts, though. Can you share the design and results
of your benchmark? Have you (or could you) prototyped anything to test it
out?
Sent from Proton Mail <https://proton.me/mail/home> for iOS
On Fri, Mar 22, 2024 at 14:36, Andrei Lazăr <lazarandrei...@gmail.com
<On+Fri,+Mar+22,+2024+at+14:36,+Andrei+Lazăr+%3C%3Ca+href=>> wrote:
Hi Gang,
Thanks a lot for getting back to me!
So the use case I am having is relatively simple: I was playing around
with
some data and I wanted to benchmark different compression algorithms in
an
effort to speed up data retrieval in a simple Parquet based database
that I
am playing around with. Whilst doing so, I've noticed a very large
variance
in the performance of the same compression algorithm over different row
groups in my Parquet files. Therefore, I was thinking that the best
compression configuration for my data would be to use a different
algorithm
for every column, for every row group in my files. In a real-world
situation, I can see this being used by a database, either when new
entries
are inserted into it, or even as a background 'optimizer' job that runs
over existing data.
How do you feel about this?
Thank you,
Andrei
On Thu, 21 Mar 2024 at 02:11, Gang Wu <ust...@gmail.com> wrote:
Hi Andrei,
What is your use case? IMHO, exposing this kind of configuration
will force users to know how will the writer split row groups, which
does not look simple to me.
Best,
Gang
On Thu, Mar 21, 2024 at 2:25 AM Andrei Lazăr <lazarandrei...@gmail.com
wrote:
Hi all,
I would like proposing adding support for writing a Parquet file with
different compression algorithms for every row group.
In my understanding, the Parquet format allows this, however it seems
to
me
that there is no way to achieve this from the C++ implementation.
Does anyone have any thoughts on this?
Thank you,
Andrei