> 1. It has to support range operations, not every type supports, especially
> for User Defined Types that are used in IN constant lists, they might only
> support equal operation.
If types are not ordered you can still create lists of points.
We’d just create a Comparable wrapper that uses the natural order of the UDT.
As you know, sargs very closely related to b-tree indexes; a sarg is basically
the definition of a b-tree index scan. Do databases prevent you from building
indexes on non-ordered types? Of course not.
> 2. The range optimization of ">2 AND <4" to "3” is only valid for
> integer-like types, but I don't think it will bring much gains. Optimizing
> col in (1,2,3,4,5) to a >=1 and a <=5 is ok, but only ok for a single or very
> limited number of disjoint ranges, but this is a very limited corner case,
> most likely we may end up with many disjoint ranges, especially when there
> are 10k values. I don't think it is worth doing SARG for IN for this sake.
What is the cost that you are worried about? I believe that a Guava
ImmutableRangeSet containing 10k point ranges will use less memory than a
RexCall(IN, RexInputRef, RexLiteral, …, RexLiteral).
It can be quickly converted back to another format (say RexCall(OR, ...)) when
you need it.
> 3. For non-integer data types, like double or string, we will end up with
> ranges {[a,a], [b,b], [c,c]...}, the stats derivation e.g. inferring
> selectivity from histogram may take much longer time depends on the
> implementation.
Histograms are usually based on ranges, not points. So I’d expect that sargs
would seem to be a better representation.
> 4. It is not extensible. It can only be used for iN or NOT IN. What about
> customized operators like geospatial intersect? e.g. col intersect ANY(area1,
> area2, area3)
Sure, the approach has its limits. But it goes quite a lot further than IN,
which is just a list of points, at very low cost.
Areas that have not-totally-ordered data types tend to have their own operators
already. For your example I’d write ST_Intersects(col, ST_Collect(area1, area2,
area3)), and that would be a perfectly suitable representation in RexNode-land.
Julian