The threshold should probably be set as a percentage of the number of rows in 
the segment.

Also, we should implement this by setting the encoding to be "auto" or 
something like that. That way, we can expect that some segments may have 
dictionary encoding and others may not. Otherwise, there will be inconsistency 
between table config and what exists in (some) segments

-Subbu

On 2020/02/11 22:33:34, kishore g <[email protected]> wrote: 
> Any idea on what should be the threshold?
> 
> Good point about the variable-length column, should we rely on a bloom
> filter instead?
> 
> On Tue, Feb 11, 2020 at 2:14 PM Siddharth Teotia
> <[email protected]> wrote:
> 
> > Yes. Especially if the data is fixed width and high cardinality then
> > dictionary encoding is not going to be very useful.
> >
> > May be for fixed width, we should create dictionary only if cardinality is
> > below a certain threshold?
> >
> > For variable width, whether cardinality is high or low, dictionary
> > encoding will improve filter processing if column is used heavily in
> > filters. So may be for variable width we should always create dictionary
> > unless indicated otherwise in table config?
> > ________________________________
> > From: kishore g <[email protected]>
> > Sent: Tuesday, February 11, 2020 2:10 PM
> > To: [email protected] <[email protected]>
> > Subject: Convert dictionary encoded into raw
> >
> > As of today, we apply dictionary encoding for all columns by default. We
> > should probably move a hybrid approach where we decide the encoding based
> > on the data profile. For e.g. if the cardinality of the column is very high
> > (which is the case for metrics), dictionary encoding does not provide a lot
> > of value.
> >
> > Thoughts?
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to