Are column indexes so expensive that we don't want to use them for all
columns?

On Mon, Feb 3, 2020 at 6:41 AM Gabor Szadovszky <[email protected]> wrote:

> Dear All,
>
> After adding some new statistics and encodings into Parquet it is getting
> very hard to be smart and choose the best configs automatically. For
> example for which columns should we save column index and/or bloom-filters?
> Is it worth using dictionary for a column that we know will fall back to
> another encoding?
> I think, we shall allow the users to decide the encoding of one column or
> if some optional statistics is to be saved for another. It would also help
> testing new encodings/statistics.
>
> We already have some configuration keys but only for setting the property
> for the current writing (e.g. parquet.enable.dictionary to enable
> dictionary encoding). I suggest extending such existing properties by a
> suffix that can specify the related column:
> parquet.enable.dictionary#column.path.col_1 or parquet.enable.dictionary#3
> (3 is the index of the column in the projection). For new properties we
> would keep the same format. I don't know if '#' is a valid character in the
> keys but I guess it should be fine.
>
> What do you think?
>
> Cheers,
> Gabor
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to