> 1. It has to support range operations, not every type supports, especially 
> for User Defined Types that are used in IN constant lists, they might only 
> support equal operation.

If types are not ordered you can still create lists of points.

We’d just create a Comparable wrapper that uses the natural order of the UDT.

As you know, sargs very closely related to b-tree indexes; a sarg is basically 
the definition of a b-tree index scan. Do databases prevent you from building 
indexes on non-ordered types? Of course not.

> 2. The range optimization of ">2 AND <4" to "3” is only valid for 
> integer-like types, but I don't think it will bring much gains. Optimizing  
> col in (1,2,3,4,5) to a >=1 and a <=5 is ok, but only ok for a single or very 
> limited number of disjoint ranges, but this is a very limited corner case, 
> most likely we may end up with many disjoint ranges, especially when there 
> are 10k values. I don't think it is worth doing SARG for IN for this sake.

What is the cost that you are worried about? I believe that a Guava 
ImmutableRangeSet containing 10k point ranges will use less memory than a 
RexCall(IN, RexInputRef, RexLiteral, …, RexLiteral).

It can be quickly converted back to another format (say RexCall(OR, ...)) when 
you need it.

> 3. For non-integer data types, like double or string, we will end up with 
> ranges {[a,a], [b,b], [c,c]...}, the stats derivation e.g. inferring 
> selectivity from histogram may take much longer time depends on the 
> implementation.

Histograms are usually based on ranges, not points. So I’d expect that sargs 
would seem to be a better representation.

> 4. It is not extensible. It can only be used for iN or NOT IN. What about 
> customized operators like geospatial intersect? e.g. col intersect ANY(area1, 
> area2, area3)

Sure, the approach has its limits. But it goes quite a lot further than IN, 
which is just a list of points, at very low cost.

Areas that have not-totally-ordered data types tend to have their own operators 
already. For your example I’d write ST_Intersects(col, ST_Collect(area1, area2, 
area3)), and that would be a perfectly suitable representation in RexNode-land.

Julian

Reply via email to