I am against the proposal, for the following reasons: 1. It has to support range operations, not every type supports, especially for User Defined Types that are used in IN constant lists, they might only support equal operation.
2. The range optimization of ">2 AND <4" to "3” is only valid for integer-like types, but I don't think it will bring much gains. Optimizing col in (1,2,3,4,5) to a >=1 and a <=5 is ok, but only ok for a single or very limited number of disjoint ranges, but this is a very limited corner case, most likely we may end up with many disjoint ranges, especially when there are 10k values. I don't think it is worth doing SARG for IN for this sake. 3. For non-integer data types, like double or string, we will end up with ranges {[a,a], [b,b], [c,c]...}, the stats derivation e.g. inferring selectivity from histogram may take much longer time depends on the implementation. And when passing to executor, we have to transform it back to form of col = ANY(a,b,c) or col IN (a,b,c) to be able to utilize hash look up. 4. It is not extensible. It can only be used for iN or NOT IN. What about customized operators like geospatial intersect? e.g. col intersect ANY(area1, area2, area3) Haisheng On 2020/08/10 16:15:56, Michael Mior <mm...@apache.org> wrote: > The simplifications you present would also of course assume that the > column has integer type. In any case, overall the proposal seems solid > to me. > > -- > Michael Mior > mm...@apache.org > > Le lun. 10 août 2020 à 04:21, Fan Liya <liya.fa...@gmail.com> a écrit : > > > > Hi Julian, > > > > Thanks for opening the discussion. > > > > In general, I am convinced of the value and importance of the proposal. > > > > I want to discuss more about the algebra involved, as the simplified range > > sets may not have the minimum computational costs. > > Some examples: > > > > 1. Suppose we have expression x in (1, 2, 4, 5, 6). > > > > The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >= > > 4 and x <= 6), which represents 4 comparisons and 3 logical operations. > > Another equivalent expression is x = 1 or x = 2 or x = 4 or x = 5 or x = 6, > > which represents 5 comparisons and 4 logical operations. > > It's hard to say which one has a smaller computational cost. > > > > 2. Suppose we have expression x in (1, 2, 4, 5) > > The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >= > > 4 and x <= 5). > > Another (possibly cheaper) reduction should be x >= 1 and x <= 5 and x != 3. > > I am not sure if the range set implementation supports this type of > > simplification. > > > > Therefore, to address above problems, > > 1. We may need a model to estimate the cost, and the simplification process > > should be guided by the cost model. > > 2. Maybe we need to extend the range set implementation so it produces > > expressions with minimum computational cost. > > > > Best, > > Liya Fan > > > > > > > > > > > > On Mon, Aug 10, 2020 at 3:27 AM Julian Hyde <jhyde.apa...@gmail.com> wrote: > > > > > We have had several discussions over the years about how to represent > > > IN-lists (e.g. “x IN (1, 3, 5)”) in RexNode-land. > > > > > > I have generally taken the position that we should expand to ORs (e.g. “x > > > = 1 OR x = 3 OR x = 5”) but a few months ago accepted that we should allow > > > IN in RexCall. > > > > > > I have given this some further thought, as part of a couple of RexSimplify > > > bugs [1] [2] and I now think we should replace IN with something more > > > powerful, namely sargs. A sarg (or search argument [3]) is an ordered set > > > of intervals, and can be represented as a Guava ImmutableRangeSet, such as > > > "[[0‥1), (1‥2], [3‥3], (5‥+∞)]". It can represent an IN-list of constants, > > > but also ranges. > > > > > > Today you would write > > > > > > RexCall(IN, RexInputRef(0), RexLiteral(1), RexLiteral(3), RexLiteral(5)) > > > > > > With sargs, you would instead write > > > > > > RexCall(IN_SARG, RexInputRef(0), RexSarg(ImmutableRangeSet(“[[1..1], > > > [3..3], [5..5]]”))) > > > > > > There is a new operator IN_SARG, and a new node type RexSarg (it could > > > almost be a RexLiteral). > > > > > > Sargs (and Guava RangeSets) have a powerful and consistent algebra, so if > > > we invest in sarg support in RexSimplify and RelOptPredicateList, that > > > investment is likely to pay dividends in better plans. > > > > > > Guava RangeSets, and hence sargs, have support for discrete domains, so > > > they can easily optimize ">2 AND <4" to "3”. > > > > > > Sargs would be the preferred (therefore canonical) form for any AND or OR > > > list that has more than one comparison on the same operand (e.g. "x > 3 > > > AND > > > x < 17 AND x <> 10”). > > > > > > This proposal would subsume IN and therefore we would stop supporting IN > > > in RexCall. > > > > > > Julian > > > > > > [1] https://issues.apache.org/jira/browse/CALCITE-4155 < > > > https://issues.apache.org/jira/browse/CALCITE-4155> > > > > > > [2] https://github.com/julianhyde/calcite/tree/4159-simplify < > > > https://github.com/julianhyde/calcite/tree/4159-simplify> > > > > > > [3] https://en.wikipedia.org/wiki/Sargable < > > > https://en.wikipedia.org/wiki/Sargable> >