Are column indexes so expensive that we don't want to use them for all columns?
On Mon, Feb 3, 2020 at 6:41 AM Gabor Szadovszky <[email protected]> wrote: > Dear All, > > After adding some new statistics and encodings into Parquet it is getting > very hard to be smart and choose the best configs automatically. For > example for which columns should we save column index and/or bloom-filters? > Is it worth using dictionary for a column that we know will fall back to > another encoding? > I think, we shall allow the users to decide the encoding of one column or > if some optional statistics is to be saved for another. It would also help > testing new encodings/statistics. > > We already have some configuration keys but only for setting the property > for the current writing (e.g. parquet.enable.dictionary to enable > dictionary encoding). I suggest extending such existing properties by a > suffix that can specify the related column: > parquet.enable.dictionary#column.path.col_1 or parquet.enable.dictionary#3 > (3 is the index of the column in the projection). For new properties we > would keep the same format. I don't know if '#' is a valid character in the > keys but I guess it should be fine. > > What do you think? > > Cheers, > Gabor > -- Ryan Blue Software Engineer Netflix
