Regardless of whether they have different compression ratios, it doesn't explain why you would want a different compression *algorithm* altogether.

The choice of a compression algorithm should basically be driven by two concerns: the acceptable space/time tradeoff (do you want to minimize disk footprint and IO at the cost of more CPU processing time?), and compatibility with other Parquet implementations. None of those two concerns should be row group-dependent.

Regards

Antoine.


Le 25/03/2024 à 16:30, Gang Wu a écrit :
Sometimes rows from different row groups may have different compression
ratios when data distribution varies a lot among them. It seems to me that
a harder problem is how would you figure out that pattern before the data
is written and compressed. If that is not a problem in your case, it would
be
much easier just to make each parquet file contain only one row group and
apply different compression algorithms on a file basis.

Best,
Gang

On Sun, Mar 24, 2024 at 2:04 AM Aldrin <octalene....@pm.me.invalid> wrote:

Hi Andrei,

I tried finding more details on block compression in parquet (or
compression per data page) and I couldn't find anything to satisfy my
curiosity about how it can be used and how it performs.

I hate being the person to just say "test it first," so I want to also
recommend figuring out how you'd imagine the interface to be designed. Some
formats like ORC seem to have 2 compression modes (optimize for speed or
space) while parquet exposes more of the tuning knobs (according to [1]).
And to Gang's point, there's a question of what can be exposed to the
various abstraction levels (perhaps end users would never be interested in
this so it's exposed only through an advanced or internal interface).

Anyways, good luck scoping it out and feel free to iterate with the
mailing list as you try things out rather than just when finished, maybe
someone can chime in with more information and thoughts in the meantime.

[1]: https://arxiv.org/pdf/2304.05028.pdf

Sent from Proton Mail <https://proton.me/mail/home> for iOS


On Sat, Mar 23, 2024 at 05:23, Andrei Lazăr <lazarandrei...@gmail.com
<On+Sat,+Mar+23,+2024+at+05:23,+Andrei+Lazăr+%3C%3Ca+href=>> wrote:

Hi Aldrin, thanks for taking the time to reply to my email!

In my understanding, compression on Parquet files happens on the Data Page
level for every column, meaning that even across a row group, there can be
multiple units of data compression, and most certainly there are going to
be different units of data compression across an entire Parquet file.
Therefore, what I am hoping for is that more granular compression algorithm
choices could lead to overall better compression as the data in the same
column across row groups can differ quite a lot.

At this very moment, specifying different compression algorithms per column
is supported and in my use case it is extremely helpful, as I have some
columns (mostly containing floats), for which a compression algorithm like
Snappy (or even no compression at all), significantly speeds up my queries
than keeping the data compressed with something like ZSTD or GZIP.

That being said, your suggestion of writing a benchmark and sharing the
results here to support considering this approach is a great idea, I will
try doing that!

Once again, thank you for your time!

Kind regards,
Andrei

On Fri, 22 Mar 2024 at 22:12, Aldrin <octalene....@pm.me.invalid> wrote:

Hello!

I don't do much with compression, so I could be wrong, but I assume a
compression algorithm spans the whole column and areas of large variance
generally benefit less from the compression, but the encoding still
provides benefits across separate areas (e.g. separate row groups).

My impression is that compression will not be any better if it's
restricted to only a subset of the data and if it is only scoped to a
subset of the data then there are extra overheads you'd have beyond what
you normally would have (the same raw value would have the same encoded
value stored per row group). I suppose things like run-length encoding
won't be any less efficient, but it also wouldn't be any more efficient
(with the caveat of a raw value repeating across row groups).

A different compression for different columns isn't unreasonable, so I
think I could be easily convinced that has benefits (though would require
per-column logic that could slow other things down).

These are just my thoughts, though. Can you share the design and results
of your benchmark? Have you (or could you) prototyped anything to test it
out?

Sent from Proton Mail <https://proton.me/mail/home> for iOS


On Fri, Mar 22, 2024 at 14:36, Andrei Lazăr <lazarandrei...@gmail.com
<On+Fri,+Mar+22,+2024+at+14:36,+Andrei+Lazăr+%3C%3Ca+href=>> wrote:

Hi Gang,

Thanks a lot for getting back to me!

So the use case I am having is relatively simple: I was playing around
with
some data and I wanted to benchmark different compression algorithms in
an
effort to speed up data retrieval in a simple Parquet based database
that I
am playing around with. Whilst doing so, I've noticed a very large
variance
in the performance of the same compression algorithm over different row
groups in my Parquet files. Therefore, I was thinking that the best
compression configuration for my data would be to use a different
algorithm
for every column, for every row group in my files. In a real-world
situation, I can see this being used by a database, either when new
entries
are inserted into it, or even as a background 'optimizer' job that runs
over existing data.

How do you feel about this?

Thank you,
Andrei

On Thu, 21 Mar 2024 at 02:11, Gang Wu <ust...@gmail.com> wrote:

Hi Andrei,

What is your use case? IMHO, exposing this kind of configuration
will force users to know how will the writer split row groups, which
does not look simple to me.

Best,
Gang

On Thu, Mar 21, 2024 at 2:25 AM Andrei Lazăr <lazarandrei...@gmail.com

wrote:

Hi all,

I would like proposing adding support for writing a Parquet file with
different compression algorithms for every row group.

In my understanding, the Parquet format allows this, however it seems
to
me
that there is no way to achieve this from the C++ implementation.

Does anyone have any thoughts on this?

Thank you,
Andrei







Reply via email to