Hi, Julian. I would like to review it.
Best, Chunwei On Sat, Aug 29, 2020 at 5:19 AM Julian Hyde <[email protected]> wrote: > I now have a PR for this change. Can some people review? > > https://github.com/apache/calcite/pull/2124 < > https://github.com/apache/calcite/pull/2124> > > There are legitimate concerns that we are adding a new RexCall operator > (SEARCH) and there is a considerable effort to support it everywhere. But > we are removing two RexCall operators (IN-list and BETWEEN), because SEARCH > subsumes them. > > The benefit is that we can do ‘range algebra’ simplifications without > resorting to error-prone Boolean logic simplifications as often as we do > today. > > I have added utilities to introduce SEARCH into SEARCH-less RexNode > expressions, and to remove SEARCH from RexNode expressions, and to convert > RexNode expressions that contain SEARCH directly into SqlNode expressions. > > In several places, I just expanded SEARCH, but it would be better to > handle SEARCH explicitly, as if it were a comparison operator (along with > =, <, <= etc.) It might be worth revisiting the Mongo, Geode and Druid > adapters. Geode has an ‘IS SET’ construct, very similar to SQL ‘IN-list’, > that would easier to recognize on SEARCH than on OR. > > Julian > > > > > > On Aug 13, 2020, at 12:48 PM, Julian Hyde <[email protected]> > wrote: > > > > I have logged https://issues.apache.org/jira/browse/CALCITE-4173 < > https://issues.apache.org/jira/browse/CALCITE-4173> and am close to a > prototype. > > > > See comments inline, below. > > > >> On Aug 11, 2020, at 3:39 PM, Haisheng Yuan <[email protected] <mailto: > [email protected]>> wrote: > >> > >>> I am proposing to use sargs only where we would today use RexCall(IN). > The data structures have about the same size. Sargs are sorted. I cannot > see any cases that would become more expensive. > >> > >> IIRC, RexCall(IN) is not supported yet. > > > > It’s not officially supported. But it is there - added in > https://issues.apache.org/jira/browse/CALCITE-2444 < > https://issues.apache.org/jira/browse/CALCITE-2444> for just Rel-to-Sql > part of the process, and people seem to be using it elsewhere, e.g. > https://issues.apache.org/jira/browse/CALCITE-1413 < > https://issues.apache.org/jira/browse/CALCITE-1413> handles IN in CASE. > > > > I have not used RexCall.IN personally. I think there’s too much cost (in > making all places aware of the IN operator) and not enough benefit (because > it cannot handle ranges or <>). > > > > One part of the cost is null-semantics. If the values are not literals, > one of them might evaluate to NULL. So we have to be very conservative, > especially when optimizing “NOT IN (list)”. There are no tests for that > simplification, so we’re probably doing it wrong. > > > > So, I propose to remove IN. > > > >> With sargs, you have to sort it, no matter explicitly or implicitly, in > case 20k values, it will take some time. > > > > I am not too concerned about the cost of sorting. If you have parsed and > validated an IN list with 20k items, the query has already burned a few CPU > cycles. > > > > The Sarg data structure is immutable and one instance should suffice for > the whole planning process, so the cost of sorting on creation is more than > repaid by the faster access later on. > > > >> I am not saying the data structure itself is expensive, but how it is > used, it all depends on how downstream projects derive stats from it. I > have seen Greenplum optimizer experienced performance issue for large IN > list when deriving all the stats info from the thousands of disjoint range > intervals. Some downstream project may not even use histogram. Some may > just use the number of IN constants to derive the number of distinct > values, I would imagine people have to construct the Sarg back to constant > list, this is an extra burden. > > > > If there are performance issues, we can add extra operations to the Sarg > data structure without changing its size/cost. For example, record whether > all the ranges are points. And if so, efficiently iterate over all the > values. > > > >> Suppose we have col in (1,2,4,6,8,10, 12,13, 15,17,19, 21,22, 24, > 24+2.....) 10k values, now the range sets will be {[1,2], [4,4], > ....[10,10], [12,13], [15,15]....[21,22], [24,24]....}, in the execution > engine, what if I want to do the look up for this expression? Find out the > points, extract ranges and disjoint them, or infer points from integer > ranges? > > > > You should transcribe the Sarg into a data structure that makes sense > for your execution engine. Say a bloom filter, a compressed bitmap, or a > perfect hash. > > > >> IMHO, the interval constraint is more like a kind of logical property > and can be strategy for simplification, people can derive the interval > constraints from the IN list, do simplification on it (but I doubt the > value in practice, which usually brings more trouble than benefits), but we > don't have to represent it with sarg in RexNode world. > > > > I agree that it should be a logical property. It should be part of > RelOptPredicateList. But to apply those predicates efficiently, we need one > predicate that searches in a range rather than 20k predicates. > > > >> I am not asking sargs to support geospatial operations. And geospatial > intersect is just an example of potential customized operation, which are > not limited to geospatials. It can be some other binary operations. We > can't require all the customized operation to have something like > ST_Collect. > > > > Sure, there are other techniques. One of which is generating rasters > (e.g. if you are doing a spatial join of states to national parks, > pre-generate for each national park a bitmap of the all 100x100km squares, > and put a 1 in the square if it overlaps the park). Rasters are kind of > similar to sargs. > > > > I have been working on space-filling curves as indexes for > point-to-shape joins, as part of > https://issues.apache.org/jira/browse/CALCITE-1861 < > https://issues.apache.org/jira/browse/CALCITE-1861>. In > https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966 > < > https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966> > see how we transform a call to ST_DWithin to the predicate > > > > LogicalFilter(condition=[AND(OR(AND(>=($4, 36496), <=($4, 36520)), > AND(>=($4, 36456), <=($4, 36464)), AND(>=($4, 33252), <=($4, 33254)), > AND(>=($4, 33236), <=($4, 33244)), AND(>=($4, 33164), <=($4, 33176)), > AND(>=($4, 33112), <=($4, 33156)), AND(>=($4, 33092), <=($4, 33100)), > AND(>=($4, 33055), <=($4, 33080)), AND(>=($4, 33050), <=($4, 33053)), > AND(>=($4, 33033), <=($4, 33035))), ST_DWITHIN(POINT (10 20):GEOMETRY, > ST_POINT($1, $2), 6))]) > > > > The first part of the condition is precisely a Sarg - a set of integer > ranges. If represented an a set of points, it would be much larger. > > > >> Oracle support this: > >> select * from foo where a > ANY(1, 2, 3, ....., 999); > >> select * from foo where a <= ANY(b+1, b+2, b+3, ....., 999); > >> Although in reality that kind of query might be rare, how would Calcite > represent in Oracle dialect? And would it be good to support operations > other than equal, like <, >=, or even customized operation? Especially when > downstream projects may use calcite to build their own in-house system, > some are not sql standard. > > > > I think of ANY and ALL - especially on constant lists, as opposed to > queries - as syntactic sugar. > > > > The first one can be represented as a Sarg: > > > > SEARCH(a, SARG((1, inf), (2, inf), (3, inf), … (999, inf)) > > > > And the Sarg would optimize itself, on construction, to a single range, > SARG((1, inf)). This illustrates the power of Sargs - they are simple > mathematical objects, they are in a canonical form, and there are only a > few operations. It would take thought to optimize > > > > a > ANY (1, 2, 3, …, 9) > > > > to > > > > a > 1 > > > > but Sarg does it automatically. > > > > The second query cannot be represented as a Sarg, because it is not a > constant. > > > >> BTW, what is your concern about the proposal of RexListCmp [1]? > > > > RexListCmp is certainly moving in the same direction as Sarg - a > generalized IN-list. But it is less elegant - it is not clear to me what > are the fundamental operations on a RexListCmp. > > > > The ability to choose your operator adds complexity but little power. > The “generalized IN lists” that I see in plans are better modeled by a > mixture of operators - "x > 3 AND x < 10 OR x = 100 OR x > 1000” than > having one operator throughout the list. > > > > I was not aware of Oracle's “expression relop (ANY | ALL) (list)” syntax > and I don’t think it’s very common. It can be expanded at compile time to > an OR, and that OR can be converted to a Sarg if list consists only of > constant expressions. I think that is a good solution to the common case. > > > > Julian > > > >
