Integer is of course the main example but when I say I like “algebraic” approach of Guava range sets I mean that it operates based on the properties of data types, not any hard-coded list.
Sargs (and Guava range sets) merely require the values to be comparable (i.e. totally ordered). So they would apply to string and float types, for instance. Or lexically ordered tuples, (‘a’, 1), (‘b’, 2). Null does not fit neatly into a total order, so we’d have to find a way to finesse expressions like ‘x between 3 and 5 or x is null’. Sargs (and Guava range sets) are able to do additional optimizations if the domain is discrete. The optimization example I gave, ">2 AND <4" to “3”, uses that discrete property. Integers are the main example of discrete domains, but “INTERVAL DAY” and “BOOLEAN” are other examples. Bounded-length strings (e.g. VARCHAR(10)) are also a discrete domain, but the Guava documentation recommends against it. Maybe there are just too many values. I do think it would be worth treating constrained types, e.g. paymentType VARCHAR(6) CHECK (paymentType IN (‘CASH’,’CREDIT’)) as discrete domains. Julian > On Aug 10, 2020, at 9:15 AM, Michael Mior <[email protected]> wrote: > > The simplifications you present would also of course assume that the > column has integer type. In any case, overall the proposal seems solid > to me. > > -- > Michael Mior > [email protected] > > Le lun. 10 août 2020 à 04:21, Fan Liya <[email protected]> a écrit : >> >> Hi Julian, >> >> Thanks for opening the discussion. >> >> In general, I am convinced of the value and importance of the proposal. >> >> I want to discuss more about the algebra involved, as the simplified range >> sets may not have the minimum computational costs. >> Some examples: >> >> 1. Suppose we have expression x in (1, 2, 4, 5, 6). >> >> The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >= >> 4 and x <= 6), which represents 4 comparisons and 3 logical operations. >> Another equivalent expression is x = 1 or x = 2 or x = 4 or x = 5 or x = 6, >> which represents 5 comparisons and 4 logical operations. >> It's hard to say which one has a smaller computational cost. >> >> 2. Suppose we have expression x in (1, 2, 4, 5) >> The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >= >> 4 and x <= 5). >> Another (possibly cheaper) reduction should be x >= 1 and x <= 5 and x != 3. >> I am not sure if the range set implementation supports this type of >> simplification. >> >> Therefore, to address above problems, >> 1. We may need a model to estimate the cost, and the simplification process >> should be guided by the cost model. >> 2. Maybe we need to extend the range set implementation so it produces >> expressions with minimum computational cost. >> >> Best, >> Liya Fan >> >> >> >> >> >> On Mon, Aug 10, 2020 at 3:27 AM Julian Hyde <[email protected]> wrote: >> >>> We have had several discussions over the years about how to represent >>> IN-lists (e.g. “x IN (1, 3, 5)”) in RexNode-land. >>> >>> I have generally taken the position that we should expand to ORs (e.g. “x >>> = 1 OR x = 3 OR x = 5”) but a few months ago accepted that we should allow >>> IN in RexCall. >>> >>> I have given this some further thought, as part of a couple of RexSimplify >>> bugs [1] [2] and I now think we should replace IN with something more >>> powerful, namely sargs. A sarg (or search argument [3]) is an ordered set >>> of intervals, and can be represented as a Guava ImmutableRangeSet, such as >>> "[[0‥1), (1‥2], [3‥3], (5‥+∞)]". It can represent an IN-list of constants, >>> but also ranges. >>> >>> Today you would write >>> >>> RexCall(IN, RexInputRef(0), RexLiteral(1), RexLiteral(3), RexLiteral(5)) >>> >>> With sargs, you would instead write >>> >>> RexCall(IN_SARG, RexInputRef(0), RexSarg(ImmutableRangeSet(“[[1..1], >>> [3..3], [5..5]]”))) >>> >>> There is a new operator IN_SARG, and a new node type RexSarg (it could >>> almost be a RexLiteral). >>> >>> Sargs (and Guava RangeSets) have a powerful and consistent algebra, so if >>> we invest in sarg support in RexSimplify and RelOptPredicateList, that >>> investment is likely to pay dividends in better plans. >>> >>> Guava RangeSets, and hence sargs, have support for discrete domains, so >>> they can easily optimize ">2 AND <4" to "3”. >>> >>> Sargs would be the preferred (therefore canonical) form for any AND or OR >>> list that has more than one comparison on the same operand (e.g. "x > 3 AND >>> x < 17 AND x <> 10”). >>> >>> This proposal would subsume IN and therefore we would stop supporting IN >>> in RexCall. >>> >>> Julian >>> >>> [1] https://issues.apache.org/jira/browse/CALCITE-4155 < >>> https://issues.apache.org/jira/browse/CALCITE-4155> >>> >>> [2] https://github.com/julianhyde/calcite/tree/4159-simplify < >>> https://github.com/julianhyde/calcite/tree/4159-simplify> >>> >>> [3] https://en.wikipedia.org/wiki/Sargable < >>> https://en.wikipedia.org/wiki/Sargable>
