Any idea on what should be the threshold? Good point about the variable-length column, should we rely on a bloom filter instead?
On Tue, Feb 11, 2020 at 2:14 PM Siddharth Teotia <[email protected]> wrote: > Yes. Especially if the data is fixed width and high cardinality then > dictionary encoding is not going to be very useful. > > May be for fixed width, we should create dictionary only if cardinality is > below a certain threshold? > > For variable width, whether cardinality is high or low, dictionary > encoding will improve filter processing if column is used heavily in > filters. So may be for variable width we should always create dictionary > unless indicated otherwise in table config? > ________________________________ > From: kishore g <[email protected]> > Sent: Tuesday, February 11, 2020 2:10 PM > To: [email protected] <[email protected]> > Subject: Convert dictionary encoded into raw > > As of today, we apply dictionary encoding for all columns by default. We > should probably move a hybrid approach where we decide the encoding based > on the data profile. For e.g. if the cardinality of the column is very high > (which is the case for metrics), dictionary encoding does not provide a lot > of value. > > Thoughts? >
