Dear All, After adding some new statistics and encodings into Parquet it is getting very hard to be smart and choose the best configs automatically. For example for which columns should we save column index and/or bloom-filters? Is it worth using dictionary for a column that we know will fall back to another encoding? I think, we shall allow the users to decide the encoding of one column or if some optional statistics is to be saved for another. It would also help testing new encodings/statistics.
We already have some configuration keys but only for setting the property for the current writing (e.g. parquet.enable.dictionary to enable dictionary encoding). I suggest extending such existing properties by a suffix that can specify the related column: parquet.enable.dictionary#column.path.col_1 or parquet.enable.dictionary#3 (3 is the index of the column in the projection). For new properties we would keep the same format. I don't know if '#' is a valid character in the keys but I guess it should be fine. What do you think? Cheers, Gabor
