I am against the proposal, for the following reasons:

1. It has to support range operations, not every type supports, especially for 
User Defined Types that are used in IN constant lists, they might only support 
equal operation.

2. The range optimization of ">2 AND <4" to "3” is only valid for integer-like 
types, but I don't think it will bring much gains. Optimizing  col in 
(1,2,3,4,5) to a >=1 and a <=5 is ok, but only ok for a single or very limited 
number of disjoint ranges, but this is a very limited corner case, most likely 
we may end up with many disjoint ranges, especially when there are 10k values. 
I don't think it is worth doing SARG for IN for this sake.

3. For non-integer data types, like double or string, we will end up with 
ranges {[a,a], [b,b], [c,c]...}, the stats derivation e.g. inferring 
selectivity from histogram may take much longer time depends on the 
implementation. And when passing to executor, we have to transform it back to 
form of col = ANY(a,b,c) or col IN (a,b,c) to be able to utilize hash look up.

4. It is not extensible. It can only be used for iN or NOT IN. What about 
customized operators like geospatial intersect? e.g. col intersect ANY(area1, 
area2, area3)

Haisheng

On 2020/08/10 16:15:56, Michael Mior <mm...@apache.org> wrote: 
> The simplifications you present would also of course assume that the
> column has integer type. In any case, overall the proposal seems solid
> to me.
> 
> --
> Michael Mior
> mm...@apache.org
> 
> Le lun. 10 août 2020 à 04:21, Fan Liya <liya.fa...@gmail.com> a écrit :
> >
> > Hi Julian,
> >
> > Thanks for opening the discussion.
> >
> > In general, I am convinced of the value and importance of the proposal.
> >
> > I want to discuss more about the algebra involved, as the simplified range
> > sets may not have the minimum computational costs.
> > Some examples:
> >
> > 1. Suppose we have expression x in (1, 2, 4, 5, 6).
> >
> > The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
> > 4 and x <= 6), which represents 4 comparisons and 3 logical operations.
> > Another equivalent expression is x = 1 or x = 2 or x = 4 or x = 5 or x = 6,
> > which represents 5 comparisons and 4 logical operations.
> > It's hard to say which one has a smaller computational cost.
> >
> > 2. Suppose we have expression x in (1, 2, 4, 5)
> > The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
> > 4 and x <= 5).
> > Another (possibly cheaper) reduction should be x >= 1 and x <= 5 and x != 3.
> > I am not sure if the range set implementation supports this type of
> > simplification.
> >
> > Therefore, to address above problems,
> > 1. We may need a model to estimate the cost, and the simplification process
> > should be guided by the cost model.
> > 2. Maybe we need to extend the range set implementation so it produces
> > expressions with minimum computational cost.
> >
> > Best,
> > Liya Fan
> >
> >
> >
> >
> >
> > On Mon, Aug 10, 2020 at 3:27 AM Julian Hyde <jhyde.apa...@gmail.com> wrote:
> >
> > > We have had several discussions over the years about how to represent
> > > IN-lists (e.g. “x IN (1, 3, 5)”) in RexNode-land.
> > >
> > > I have generally taken the position that we should expand to ORs (e.g. “x
> > > = 1 OR x = 3 OR x = 5”) but a few months ago accepted that we should allow
> > > IN in RexCall.
> > >
> > > I have given this some further thought, as part of a couple of RexSimplify
> > > bugs [1] [2] and I now think we should replace IN with something more
> > > powerful, namely sargs. A sarg (or search argument [3]) is an ordered set
> > > of intervals, and can be represented as a Guava ImmutableRangeSet, such as
> > > "[[0‥1), (1‥2], [3‥3], (5‥+∞)]". It can represent an IN-list of constants,
> > > but also ranges.
> > >
> > > Today you would write
> > >
> > >   RexCall(IN, RexInputRef(0), RexLiteral(1), RexLiteral(3), RexLiteral(5))
> > >
> > > With sargs, you would instead write
> > >
> > >   RexCall(IN_SARG, RexInputRef(0), RexSarg(ImmutableRangeSet(“[[1..1],
> > > [3..3], [5..5]]”)))
> > >
> > > There is a new operator IN_SARG, and a new node type RexSarg (it could
> > > almost be a RexLiteral).
> > >
> > > Sargs (and Guava RangeSets) have a powerful and consistent algebra, so if
> > > we invest in sarg support in RexSimplify and RelOptPredicateList, that
> > > investment is likely to pay dividends in better plans.
> > >
> > > Guava RangeSets, and hence sargs, have support for discrete domains, so
> > > they can easily optimize ">2 AND <4" to "3”.
> > >
> > > Sargs would be the preferred (therefore canonical) form for any AND or OR
> > > list that has more than one comparison on the same operand (e.g. "x > 3 
> > > AND
> > > x < 17 AND x <> 10”).
> > >
> > > This proposal would subsume IN and therefore we would stop supporting IN
> > > in RexCall.
> > >
> > > Julian
> > >
> > > [1] https://issues.apache.org/jira/browse/CALCITE-4155 <
> > > https://issues.apache.org/jira/browse/CALCITE-4155>
> > >
> > > [2] https://github.com/julianhyde/calcite/tree/4159-simplify <
> > > https://github.com/julianhyde/calcite/tree/4159-simplify>
> > >
> > > [3] https://en.wikipedia.org/wiki/Sargable <
> > > https://en.wikipedia.org/wiki/Sargable>
> 

Reply via email to