Hi Ryan,

I wouldn't say they are expensive. But, in case the column data is kind of
random the column indexes would not help in filtering but would have a
small overhead in performance. Why would we save column indexes for such
columns wasting (a little amount of) space and some time at filtering? A
more simpler situation when the user know that the related column won't be
used for selective queries at all.
But, column indexes was only an example where the user might want to
fine-tune the configuration. If we want to introduce new encodings it would
help a lot if the user would be able to select the exact encoding for a
column and/or switch the dictionary encoding off for that column.
Of course, the default mechanism would remain the same as it is: we try to
figure out the best configuration for each column.

What do you think about the approach of introducing the suffix described
above?

Regards,
Gabor

On Mon, Feb 3, 2020 at 6:26 PM Ryan Blue <[email protected]> wrote:

> Are column indexes so expensive that we don't want to use them for all
> columns?
>
> On Mon, Feb 3, 2020 at 6:41 AM Gabor Szadovszky <[email protected]> wrote:
>
> > Dear All,
> >
> > After adding some new statistics and encodings into Parquet it is getting
> > very hard to be smart and choose the best configs automatically. For
> > example for which columns should we save column index and/or
> bloom-filters?
> > Is it worth using dictionary for a column that we know will fall back to
> > another encoding?
> > I think, we shall allow the users to decide the encoding of one column or
> > if some optional statistics is to be saved for another. It would also
> help
> > testing new encodings/statistics.
> >
> > We already have some configuration keys but only for setting the property
> > for the current writing (e.g. parquet.enable.dictionary to enable
> > dictionary encoding). I suggest extending such existing properties by a
> > suffix that can specify the related column:
> > parquet.enable.dictionary#column.path.col_1 or
> parquet.enable.dictionary#3
> > (3 is the index of the column in the projection). For new properties we
> > would keep the same format. I don't know if '#' is a valid character in
> the
> > keys but I guess it should be fine.
> >
> > What do you think?
> >
> > Cheers,
> > Gabor
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to