Thanks for the support.
I've created PARQUET-1784
<https://issues.apache.org/jira/browse/PARQUET-1784> to track this one. Do
not hesitate to comment on the jira.

Cheers,
Gabor

On Tue, Feb 4, 2020 at 6:34 PM Radev, Martin <[email protected]> wrote:

> Dear all,
>
>
> in our project of using Parquet for streaming fp data with various
> entropy, we definitely needed to treat the columns differently.
>
> For fp data with low entropy, dictionary encoding provided good results.
> For fp data with entropy >15 bits element, the newly added encoding + zstd
> yielded much better results in terms of compression throughput and ratio.
>
>
> I think that it makes sense to offer the possibility to configure each
> column differently if the developer wants to, but still keep the default
> paths in tact.
>
>
> Kind regards,
>
> Martin
>
> ________________________________
> From: Manik Singla <[email protected]>
> Sent: Tuesday, February 4, 2020 6:03:14 PM
> To: Parquet Dev
> Subject: Re: Allow users to fine-tune parquet writing
>
> I think making parquet more configurable is nice idea.
> We had similar kind of  requirement where we wanted to have different
> configurations for different columns. ( I dont even remember details now as
> its 2-3 months)
>
> We already had some kind of optimizations in system for frequently queried
> columns which exist irrespective of data format.  So, we thought to save
> some money by saving on storage.
>
>
>
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
>
> "Life doesn't consist in holding good cards but playing those you hold
> well."
>
>
> On Tue, Feb 4, 2020 at 1:53 PM Gabor Szadovszky <[email protected]> wrote:
>
> > Hi Ryan,
> >
> > I wouldn't say they are expensive. But, in case the column data is kind
> of
> > random the column indexes would not help in filtering but would have a
> > small overhead in performance. Why would we save column indexes for such
> > columns wasting (a little amount of) space and some time at filtering? A
> > more simpler situation when the user know that the related column won't
> be
> > used for selective queries at all.
> > But, column indexes was only an example where the user might want to
> > fine-tune the configuration. If we want to introduce new encodings it
> would
> > help a lot if the user would be able to select the exact encoding for a
> > column and/or switch the dictionary encoding off for that column.
> > Of course, the default mechanism would remain the same as it is: we try
> to
> > figure out the best configuration for each column.
> >
> > What do you think about the approach of introducing the suffix described
> > above?
> >
> > Regards,
> > Gabor
> >
> > On Mon, Feb 3, 2020 at 6:26 PM Ryan Blue <[email protected]>
> > wrote:
> >
> > > Are column indexes so expensive that we don't want to use them for all
> > > columns?
> > >
> > > On Mon, Feb 3, 2020 at 6:41 AM Gabor Szadovszky <[email protected]>
> > wrote:
> > >
> > > > Dear All,
> > > >
> > > > After adding some new statistics and encodings into Parquet it is
> > getting
> > > > very hard to be smart and choose the best configs automatically. For
> > > > example for which columns should we save column index and/or
> > > bloom-filters?
> > > > Is it worth using dictionary for a column that we know will fall back
> > to
> > > > another encoding?
> > > > I think, we shall allow the users to decide the encoding of one
> column
> > or
> > > > if some optional statistics is to be saved for another. It would also
> > > help
> > > > testing new encodings/statistics.
> > > >
> > > > We already have some configuration keys but only for setting the
> > property
> > > > for the current writing (e.g. parquet.enable.dictionary to enable
> > > > dictionary encoding). I suggest extending such existing properties
> by a
> > > > suffix that can specify the related column:
> > > > parquet.enable.dictionary#column.path.col_1 or
> > > parquet.enable.dictionary#3
> > > > (3 is the index of the column in the projection). For new properties
> we
> > > > would keep the same format. I don't know if '#' is a valid character
> in
> > > the
> > > > keys but I guess it should be fine.
> > > >
> > > > What do you think?
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>

Reply via email to