Dear all,

in our project of using Parquet for streaming fp data with various entropy, we 
definitely needed to treat the columns differently.

For fp data with low entropy, dictionary encoding provided good results. For fp 
data with entropy >15 bits element, the newly added encoding + zstd yielded 
much better results in terms of compression throughput and ratio.


I think that it makes sense to offer the possibility to configure each column 
differently if the developer wants to, but still keep the default paths in tact.


Kind regards,

Martin

________________________________
From: Manik Singla <[email protected]>
Sent: Tuesday, February 4, 2020 6:03:14 PM
To: Parquet Dev
Subject: Re: Allow users to fine-tune parquet writing

I think making parquet more configurable is nice idea.
We had similar kind of  requirement where we wanted to have different
configurations for different columns. ( I dont even remember details now as
its 2-3 months)

We already had some kind of optimizations in system for frequently queried
columns which exist irrespective of data format.  So, we thought to save
some money by saving on storage.



Regards
Manik Singla
+91-9996008893
+91-9665639677

"Life doesn't consist in holding good cards but playing those you hold
well."


On Tue, Feb 4, 2020 at 1:53 PM Gabor Szadovszky <[email protected]> wrote:

> Hi Ryan,
>
> I wouldn't say they are expensive. But, in case the column data is kind of
> random the column indexes would not help in filtering but would have a
> small overhead in performance. Why would we save column indexes for such
> columns wasting (a little amount of) space and some time at filtering? A
> more simpler situation when the user know that the related column won't be
> used for selective queries at all.
> But, column indexes was only an example where the user might want to
> fine-tune the configuration. If we want to introduce new encodings it would
> help a lot if the user would be able to select the exact encoding for a
> column and/or switch the dictionary encoding off for that column.
> Of course, the default mechanism would remain the same as it is: we try to
> figure out the best configuration for each column.
>
> What do you think about the approach of introducing the suffix described
> above?
>
> Regards,
> Gabor
>
> On Mon, Feb 3, 2020 at 6:26 PM Ryan Blue <[email protected]>
> wrote:
>
> > Are column indexes so expensive that we don't want to use them for all
> > columns?
> >
> > On Mon, Feb 3, 2020 at 6:41 AM Gabor Szadovszky <[email protected]>
> wrote:
> >
> > > Dear All,
> > >
> > > After adding some new statistics and encodings into Parquet it is
> getting
> > > very hard to be smart and choose the best configs automatically. For
> > > example for which columns should we save column index and/or
> > bloom-filters?
> > > Is it worth using dictionary for a column that we know will fall back
> to
> > > another encoding?
> > > I think, we shall allow the users to decide the encoding of one column
> or
> > > if some optional statistics is to be saved for another. It would also
> > help
> > > testing new encodings/statistics.
> > >
> > > We already have some configuration keys but only for setting the
> property
> > > for the current writing (e.g. parquet.enable.dictionary to enable
> > > dictionary encoding). I suggest extending such existing properties by a
> > > suffix that can specify the related column:
> > > parquet.enable.dictionary#column.path.col_1 or
> > parquet.enable.dictionary#3
> > > (3 is the index of the column in the projection). For new properties we
> > > would keep the same format. I don't know if '#' is a valid character in
> > the
> > > keys but I guess it should be fine.
> > >
> > > What do you think?
> > >
> > > Cheers,
> > > Gabor
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>

Reply via email to