Dear All,

After adding some new statistics and encodings into Parquet it is getting
very hard to be smart and choose the best configs automatically. For
example for which columns should we save column index and/or bloom-filters?
Is it worth using dictionary for a column that we know will fall back to
another encoding?
I think, we shall allow the users to decide the encoding of one column or
if some optional statistics is to be saved for another. It would also help
testing new encodings/statistics.

We already have some configuration keys but only for setting the property
for the current writing (e.g. parquet.enable.dictionary to enable
dictionary encoding). I suggest extending such existing properties by a
suffix that can specify the related column:
parquet.enable.dictionary#column.path.col_1 or parquet.enable.dictionary#3
(3 is the index of the column in the projection). For new properties we
would keep the same format. I don't know if '#' is a valid character in the
keys but I guess it should be fine.

What do you think?

Cheers,
Gabor

Reply via email to