Hi, Julian.

I would like to review it.


Best,
Chunwei


On Sat, Aug 29, 2020 at 5:19 AM Julian Hyde <[email protected]> wrote:

> I now have a PR for this change. Can some people review?
>
> https://github.com/apache/calcite/pull/2124 <
> https://github.com/apache/calcite/pull/2124>
>
> There are legitimate concerns that we are adding a new RexCall operator
> (SEARCH) and there is a considerable effort to support it everywhere. But
> we are removing two RexCall operators (IN-list and BETWEEN), because SEARCH
> subsumes them.
>
> The benefit is that we can do ‘range algebra’ simplifications without
> resorting to error-prone Boolean logic simplifications as often as we do
> today.
>
> I have added utilities to introduce SEARCH into SEARCH-less RexNode
> expressions, and to remove SEARCH from RexNode expressions, and to convert
> RexNode expressions that contain SEARCH directly into SqlNode expressions.
>
> In several places, I just expanded SEARCH, but it would be better to
> handle SEARCH explicitly, as if it were a comparison operator (along with
> =, <, <= etc.) It might be worth revisiting the Mongo, Geode and Druid
> adapters. Geode has an ‘IS SET’ construct, very similar to SQL ‘IN-list’,
> that would easier to recognize on SEARCH than on OR.
>
> Julian
>
>
>
>
> > On Aug 13, 2020, at 12:48 PM, Julian Hyde <[email protected]>
> wrote:
> >
> > I have logged https://issues.apache.org/jira/browse/CALCITE-4173 <
> https://issues.apache.org/jira/browse/CALCITE-4173> and am close to a
> prototype.
> >
> > See comments inline, below.
> >
> >> On Aug 11, 2020, at 3:39 PM, Haisheng Yuan <[email protected] <mailto:
> [email protected]>> wrote:
> >>
> >>> I am proposing to use sargs only where we would today use RexCall(IN).
> The data structures have about the same size. Sargs are sorted. I cannot
> see any cases that would become more expensive.
> >>
> >> IIRC, RexCall(IN) is not supported yet.
> >
> > It’s not officially supported. But it is there - added in
> https://issues.apache.org/jira/browse/CALCITE-2444 <
> https://issues.apache.org/jira/browse/CALCITE-2444> for just Rel-to-Sql
> part of the process, and people seem to be using it elsewhere, e.g.
> https://issues.apache.org/jira/browse/CALCITE-1413 <
> https://issues.apache.org/jira/browse/CALCITE-1413> handles IN in CASE.
> >
> > I have not used RexCall.IN personally. I think there’s too much cost (in
> making all places aware of the IN operator) and not enough benefit (because
> it cannot handle ranges or <>).
> >
> > One part of the cost is null-semantics. If the values are not literals,
> one of them might evaluate to NULL. So we have to be very conservative,
> especially when optimizing “NOT IN (list)”. There are no tests for that
> simplification, so we’re probably doing it wrong.
> >
> > So, I propose to remove IN.
> >
> >> With sargs, you have to sort it, no matter explicitly or implicitly, in
> case 20k values, it will take some time.
> >
> > I am not too concerned about the cost of sorting. If you have parsed and
> validated an IN list with 20k items, the query has already burned a few CPU
> cycles.
> >
> > The Sarg data structure is immutable and one instance should suffice for
> the whole planning process, so the cost of sorting on creation is more than
> repaid by the faster access later on.
> >
> >> I am not saying the data structure itself is expensive, but how it is
> used, it all depends on how downstream projects derive stats from it. I
> have seen Greenplum optimizer experienced performance issue for large IN
> list when deriving all the stats info from the thousands of disjoint range
> intervals. Some downstream project may not even use histogram. Some may
> just use the number of IN constants to derive the number of distinct
> values, I would imagine people have to construct the Sarg back to constant
> list, this is an extra burden.
> >
> > If there are performance issues, we can add extra operations to the Sarg
> data structure without changing its size/cost. For example, record whether
> all the ranges are points. And if so, efficiently iterate over all the
> values.
> >
> >> Suppose we have col in (1,2,4,6,8,10, 12,13, 15,17,19, 21,22, 24,
> 24+2.....) 10k values, now the range sets will be {[1,2], [4,4],
> ....[10,10], [12,13], [15,15]....[21,22], [24,24]....}, in the execution
> engine, what if I want to do the look up for this expression? Find out the
> points, extract ranges and disjoint them, or infer points from integer
> ranges?
> >
> > You should transcribe the Sarg into a data structure that makes sense
> for your execution engine. Say a bloom filter, a compressed bitmap, or a
> perfect hash.
> >
> >> IMHO, the interval constraint is more like a kind of logical property
> and can be strategy for simplification, people can derive the interval
> constraints from the IN list, do simplification on it (but I doubt the
> value in practice, which usually brings more trouble than benefits), but we
> don't have to represent it with sarg in RexNode world.
> >
> > I agree that it should be a logical property. It should be part of
> RelOptPredicateList. But to apply those predicates efficiently, we need one
> predicate that searches in a range rather than 20k predicates.
> >
> >> I am not asking sargs to support geospatial operations. And geospatial
> intersect is just an example of potential customized operation, which are
> not limited to geospatials. It can be some other binary operations. We
> can't require all the customized operation to have something like
> ST_Collect.
> >
> > Sure, there are other techniques. One of which is generating rasters
> (e.g. if you are doing a spatial join of states to national parks,
> pre-generate for each national park a bitmap of the all 100x100km squares,
> and put a 1 in the square if it overlaps the park). Rasters are kind of
> similar to sargs.
> >
> > I have been working on space-filling curves as indexes for
> point-to-shape joins, as part of
> https://issues.apache.org/jira/browse/CALCITE-1861 <
> https://issues.apache.org/jira/browse/CALCITE-1861>. In
> https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966
> <
> https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966>
> see how we transform a call to ST_DWithin to the predicate
> >
> > LogicalFilter(condition=[AND(OR(AND(>=($4, 36496), <=($4, 36520)),
> AND(>=($4, 36456), <=($4, 36464)), AND(>=($4, 33252), <=($4, 33254)),
> AND(>=($4, 33236), <=($4, 33244)), AND(>=($4, 33164), <=($4, 33176)),
> AND(>=($4, 33112), <=($4, 33156)), AND(>=($4, 33092), <=($4, 33100)),
> AND(>=($4, 33055), <=($4, 33080)), AND(>=($4, 33050), <=($4, 33053)),
> AND(>=($4, 33033), <=($4, 33035))), ST_DWITHIN(POINT (10 20):GEOMETRY,
> ST_POINT($1, $2), 6))])
> >
> > The first part of the condition is precisely a Sarg - a set of integer
> ranges. If represented an a set of points, it would be much larger.
> >
> >> Oracle support this:
> >> select * from foo where a > ANY(1, 2, 3, ....., 999);
> >> select * from foo where a <= ANY(b+1, b+2, b+3, ....., 999);
> >> Although in reality that kind of query might be rare, how would Calcite
> represent in Oracle dialect? And would it be good to support operations
> other than equal, like <, >=, or even customized operation? Especially when
> downstream projects may use calcite to build their own in-house system,
> some are not sql standard.
> >
> > I think of ANY and ALL - especially on constant lists, as opposed to
> queries - as syntactic sugar.
> >
> > The first one can be represented as a Sarg:
> >
> >   SEARCH(a, SARG((1, inf), (2, inf), (3, inf), … (999, inf))
> >
> > And the Sarg would optimize itself, on construction, to a single range,
> SARG((1, inf)). This illustrates the power of Sargs - they are simple
> mathematical objects, they are in a canonical form, and there are only a
> few operations. It would take thought to optimize
> >
> >   a > ANY (1, 2, 3, …, 9)
> >
> > to
> >
> >   a > 1
> >
> > but Sarg does it automatically.
> >
> > The second query cannot be represented as a Sarg, because it is not a
> constant.
> >
> >> BTW, what is your concern about the proposal of RexListCmp [1]?
> >
> > RexListCmp is certainly moving in the same direction as Sarg - a
> generalized IN-list. But it is less elegant - it is not clear to me what
> are the fundamental operations on a RexListCmp.
> >
> > The ability to choose your operator adds complexity but little power.
> The “generalized IN lists” that I see in plans are better modeled by a
> mixture of operators - "x > 3 AND x < 10 OR x = 100 OR x > 1000” than
> having one operator throughout the list.
> >
> > I was not aware of Oracle's “expression relop (ANY | ALL) (list)” syntax
> and I don’t think it’s very common. It can be expanded at compile time to
> an OR, and that OR can be converted to a Sarg if list consists only of
> constant expressions. I think that is a good solution to the common case.
> >
> > Julian
> >
>
>

Reply via email to