Hi Gang,

Thanks a lot for getting back to me!

So the use case I am having is relatively simple: I was playing around with
some data and I wanted to benchmark different compression algorithms in an
effort to speed up data retrieval in a simple Parquet based database that I
am playing around with. Whilst doing so, I've noticed a very large variance
in the performance of the same compression algorithm over different row
groups in my Parquet files. Therefore, I was thinking that the best
compression configuration for my data would be to use a different algorithm
for every column, for every row group in my files. In a real-world
situation, I can see this being used by a database, either when new entries
are inserted into it, or even as a background 'optimizer' job that runs
over existing data.

How do you feel about this?

Thank you,
Andrei

On Thu, 21 Mar 2024 at 02:11, Gang Wu <ust...@gmail.com> wrote:

> Hi Andrei,
>
> What is your use case? IMHO, exposing this kind of configuration
> will force users to know how will the writer split row groups, which
> does not look simple to me.
>
> Best,
> Gang
>
> On Thu, Mar 21, 2024 at 2:25 AM Andrei Lazăr <lazarandrei...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I would like proposing adding support for writing a Parquet file with
> > different compression algorithms for every row group.
> >
> > In my understanding, the Parquet format allows this, however it seems to
> me
> > that there is no way to achieve this from the C++ implementation.
> >
> > Does anyone have any thoughts on this?
> >
> > Thank you,
> > Andrei
> >
>

Reply via email to