Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Julian Hyde Thu, 13 Aug 2020 12:48:55 -0700

I have logged https://issues.apache.org/jira/browse/CALCITE-4173 
<https://issues.apache.org/jira/browse/CALCITE-4173> and am close to a 
prototype.

See comments inline, below.

> On Aug 11, 2020, at 3:39 PM, Haisheng Yuan <[email protected]> wrote:
> 
>> I am proposing to use sargs only where we would today use RexCall(IN). The 
>> data structures have about the same size. Sargs are sorted. I cannot see any 
>> cases that would become more expensive.
> 
> IIRC, RexCall(IN) is not supported yet.

It’s not officially supported. But it is there - added in 
https://issues.apache.org/jira/browse/CALCITE-2444 
<https://issues.apache.org/jira/browse/CALCITE-2444> for just Rel-to-Sql part 
of the process, and people seem to be using it elsewhere, e.g. 
https://issues.apache.org/jira/browse/CALCITE-1413 
<https://issues.apache.org/jira/browse/CALCITE-1413> handles IN in CASE.

I have not used RexCall.IN personally. I think there’s too much cost (in making 
all places aware of the IN operator) and not enough benefit (because it cannot 
handle ranges or <>).

One part of the cost is null-semantics. If the values are not literals, one of 
them might evaluate to NULL. So we have to be very conservative, especially 
when optimizing “NOT IN (list)”. There are no tests for that simplification, so 
we’re probably doing it wrong.

So, I propose to remove IN.

> With sargs, you have to sort it, no matter explicitly or implicitly, in case 
> 20k values, it will take some time.

I am not too concerned about the cost of sorting. If you have parsed and 
validated an IN list with 20k items, the query has already burned a few CPU 
cycles.

The Sarg data structure is immutable and one instance should suffice for the 
whole planning process, so the cost of sorting on creation is more than repaid 
by the faster access later on.

> I am not saying the data structure itself is expensive, but how it is used, 
> it all depends on how downstream projects derive stats from it. I have seen 
> Greenplum optimizer experienced performance issue for large IN list when 
> deriving all the stats info from the thousands of disjoint range intervals. 
> Some downstream project may not even use histogram. Some may just use the 
> number of IN constants to derive the number of distinct values, I would 
> imagine people have to construct the Sarg back to constant list, this is an 
> extra burden. 

If there are performance issues, we can add extra operations to the Sarg data 
structure without changing its size/cost. For example, record whether all the 
ranges are points. And if so, efficiently iterate over all the values.

> Suppose we have col in (1,2,4,6,8,10, 12,13, 15,17,19, 21,22, 24, 24+2.....) 
> 10k values, now the range sets will be {[1,2], [4,4], ....[10,10], [12,13], 
> [15,15]....[21,22], [24,24]....}, in the execution engine, what if I want to 
> do the look up for this expression? Find out the points, extract ranges and 
> disjoint them, or infer points from integer ranges?

You should transcribe the Sarg into a data structure that makes sense for your 
execution engine. Say a bloom filter, a compressed bitmap, or a perfect hash.

> IMHO, the interval constraint is more like a kind of logical property and can 
> be strategy for simplification, people can derive the interval constraints 
> from the IN list, do simplification on it (but I doubt the value in practice, 
> which usually brings more trouble than benefits), but we don't have to 
> represent it with sarg in RexNode world.

I agree that it should be a logical property. It should be part of 
RelOptPredicateList. But to apply those predicates efficiently, we need one 
predicate that searches in a range rather than 20k predicates.

> I am not asking sargs to support geospatial operations. And geospatial 
> intersect is just an example of potential customized operation, which are not 
> limited to geospatials. It can be some other binary operations. We can't 
> require all the customized operation to have something like ST_Collect. 

Sure, there are other techniques. One of which is generating rasters (e.g. if 
you are doing a spatial join of states to national parks, pre-generate for each 
national park a bitmap of the all 100x100km squares, and put a 1 in the square 
if it overlaps the park). Rasters are kind of similar to sargs.

I have been working on space-filling curves as indexes for point-to-shape 
joins, as part of https://issues.apache.org/jira/browse/CALCITE-1861 
<https://issues.apache.org/jira/browse/CALCITE-1861>. In 
https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966

<https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966>
 see how we transform a call to ST_DWithin to the predicate 

LogicalFilter(condition=[AND(OR(AND(>=($4, 36496), <=($4, 36520)), AND(>=($4, 
36456), <=($4, 36464)), AND(>=($4, 33252), <=($4, 33254)), AND(>=($4, 33236), 
<=($4, 33244)), AND(>=($4, 33164), <=($4, 33176)), AND(>=($4, 33112), <=($4, 
33156)), AND(>=($4, 33092), <=($4, 33100)), AND(>=($4, 33055), <=($4, 33080)), 
AND(>=($4, 33050), <=($4, 33053)), AND(>=($4, 33033), <=($4, 33035))), 
ST_DWITHIN(POINT (10 20):GEOMETRY, ST_POINT($1, $2), 6))])

The first part of the condition is precisely a Sarg - a set of integer ranges. 
If represented an a set of points, it would be much larger.

> Oracle support this:
> select * from foo where a > ANY(1, 2, 3, ....., 999);
> select * from foo where a <= ANY(b+1, b+2, b+3, ....., 999);
> Although in reality that kind of query might be rare, how would Calcite 
> represent in Oracle dialect? And would it be good to support operations other 
> than equal, like <, >=, or even customized operation? Especially when 
> downstream projects may use calcite to build their own in-house system, some 
> are not sql standard.

I think of ANY and ALL - especially on constant lists, as opposed to queries - 
as syntactic sugar.

The first one can be represented as a Sarg:

  SEARCH(a, SARG((1, inf), (2, inf), (3, inf), … (999, inf))

And the Sarg would optimize itself, on construction, to a single range, 
SARG((1, inf)). This illustrates the power of Sargs - they are simple 
mathematical objects, they are in a canonical form, and there are only a few 
operations. It would take thought to optimize

  a > ANY (1, 2, 3, …, 9)

to

  a > 1

but Sarg does it automatically.

The second query cannot be represented as a Sarg, because it is not a constant.

> BTW, what is your concern about the proposal of RexListCmp [1]?

RexListCmp is certainly moving in the same direction as Sarg - a generalized 
IN-list. But it is less elegant - it is not clear to me what are the 
fundamental operations on a RexListCmp.

The ability to choose your operator adds complexity but little power. The 
“generalized IN lists” that I see in plans are better modeled by a mixture of 
operators - "x > 3 AND x < 10 OR x = 100 OR x > 1000” than having one operator 
throughout the list.

I was not aware of Oracle's “expression relop (ANY | ALL) (list)” syntax and I 
don’t think it’s very common. It can be expanded at compile time to an OR, and 
that OR can be converted to a Sarg if list consists only of constant 
expressions. I think that is a good solution to the common case.

Julian

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Reply via email to