Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-09-29 Thread Julian Hyde
There is already an issue requesting RangeSet taken out of beta: 
https://github.com/google/guava/issues/3376 
 

> On Sep 29, 2020, at 1:41 PM, Stamatis Zampetakis  wrote:
> 
> If it is a big concern can't we mark our own classes/fields relying on Beta
> APIs as experimental and subject to change?
> 
> In [1] they also say the following about Beta APIs:
> 
> "All this said, @Beta features tend to remain relatively stable. If we
> decide to delete a @Beta feature, we will typically deprecate it for one
> release before deleting it.
> 
> On the other hand, if you want something taken out of @Beta, file an issue.
> We generally promote features out of @Beta only when it's specifically
> requested, so if you don't ask, it won't happen."
> 
> So in the meantime let's also request them to promote the respective APIs
> from beta.
> 
> Best,
> Stamatis
> 
> [1] https://github.com/google/guava/wiki/PhilosophyExplained
> 
> On Tue, Sep 29, 2020 at 9:31 PM Julian Hyde  wrote:
> 
>> For the record, the Druid adapter has used RangeSet for a long while, and
>> it made sense, because Druid was doing tricky computations on date ranges.
>> Introducing Sargs brought that style to other parts of Calcite.
>> 
>> If someone was to build an adapter similar to the Druid adapter, based on
>> 1.26, externally to Calcite, they probably would not have to depend on
>> RangeSet because Calcite’s Sarg class would do the rewrites that they need.
>> 
>> Julian
>> 
>> 
>>> On Sep 29, 2020, at 12:24 PM, Vladimir Sitnikov <
>> sitnikov.vladi...@gmail.com> wrote:
>>> 
>>> Julian>The vast majority of clients who use Sarg (or expressions that
>>> contain Sarg) will not reference RangeSet
>>> Julian> directly and therefore would not be impacted. So I think it’s an
>>> acceptable risk.
>>> 
>>> Well, it is hard to tell, however, I know Druid is using Sargs, and, I
>>> guess, Druid is one among the very least tested adapters.
>>> See
>>> 
>> https://github.com/apache/calcite/pull/2182/commits/edf57dce13d00d3f7c4035c323f5de2568dc8699
>>> 
>>> In other words, Druid adapter proves Calcite forces clients to use
>>> Guava's @Beta API :(
>>> 
>>> Vladimir
>> 
>> 



Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-09-29 Thread Stamatis Zampetakis
If it is a big concern can't we mark our own classes/fields relying on Beta
APIs as experimental and subject to change?

In [1] they also say the following about Beta APIs:

"All this said, @Beta features tend to remain relatively stable. If we
decide to delete a @Beta feature, we will typically deprecate it for one
release before deleting it.

On the other hand, if you want something taken out of @Beta, file an issue.
We generally promote features out of @Beta only when it's specifically
requested, so if you don't ask, it won't happen."

So in the meantime let's also request them to promote the respective APIs
from beta.

Best,
Stamatis

[1] https://github.com/google/guava/wiki/PhilosophyExplained

On Tue, Sep 29, 2020 at 9:31 PM Julian Hyde  wrote:

> For the record, the Druid adapter has used RangeSet for a long while, and
> it made sense, because Druid was doing tricky computations on date ranges.
> Introducing Sargs brought that style to other parts of Calcite.
>
> If someone was to build an adapter similar to the Druid adapter, based on
> 1.26, externally to Calcite, they probably would not have to depend on
> RangeSet because Calcite’s Sarg class would do the rewrites that they need.
>
> Julian
>
>
> > On Sep 29, 2020, at 12:24 PM, Vladimir Sitnikov <
> sitnikov.vladi...@gmail.com> wrote:
> >
> > Julian>The vast majority of clients who use Sarg (or expressions that
> > contain Sarg) will not reference RangeSet
> > Julian> directly and therefore would not be impacted. So I think it’s an
> > acceptable risk.
> >
> > Well, it is hard to tell, however, I know Druid is using Sargs, and, I
> > guess, Druid is one among the very least tested adapters.
> > See
> >
> https://github.com/apache/calcite/pull/2182/commits/edf57dce13d00d3f7c4035c323f5de2568dc8699
> >
> > In other words, Druid adapter proves Calcite forces clients to use
> > Guava's @Beta API :(
> >
> > Vladimir
>
>


Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-09-29 Thread Julian Hyde
For the record, the Druid adapter has used RangeSet for a long while, and it 
made sense, because Druid was doing tricky computations on date ranges. 
Introducing Sargs brought that style to other parts of Calcite.

If someone was to build an adapter similar to the Druid adapter, based on 1.26, 
externally to Calcite, they probably would not have to depend on RangeSet 
because Calcite’s Sarg class would do the rewrites that they need.

Julian


> On Sep 29, 2020, at 12:24 PM, Vladimir Sitnikov  
> wrote:
> 
> Julian>The vast majority of clients who use Sarg (or expressions that
> contain Sarg) will not reference RangeSet
> Julian> directly and therefore would not be impacted. So I think it’s an
> acceptable risk.
> 
> Well, it is hard to tell, however, I know Druid is using Sargs, and, I
> guess, Druid is one among the very least tested adapters.
> See
> https://github.com/apache/calcite/pull/2182/commits/edf57dce13d00d3f7c4035c323f5de2568dc8699
> 
> In other words, Druid adapter proves Calcite forces clients to use
> Guava's @Beta API :(
> 
> Vladimir



Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-09-29 Thread Vladimir Sitnikov
Julian>The vast majority of clients who use Sarg (or expressions that
contain Sarg) will not reference RangeSet
Julian> directly and therefore would not be impacted. So I think it’s an
acceptable risk.

Well, it is hard to tell, however, I know Druid is using Sargs, and, I
guess, Druid is one among the very least tested adapters.
See
https://github.com/apache/calcite/pull/2182/commits/edf57dce13d00d3f7c4035c323f5de2568dc8699

In other words, Druid adapter proves Calcite forces clients to use
Guava's @Beta API :(

Vladimir


Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-09-29 Thread Julian Hyde



> On Sep 29, 2020, at 11:40 AM, Vladimir Sitnikov  
> wrote:
> 
> Julian>Suppose - worst case scenario - that Guava 30 removes RangeSet. The
> solution would be for us to copy RangeSet and enough dependent classes into
> Calcite to serve our needs
> 
> Then all the clients would have to adjust packages because we can't copy
> Guava code and keep Guava's package names.

That is a valid concern. RangeSet used in a public field and a public method in 
class Sarg.

The vast majority of clients who use Sarg (or expressions that contain Sarg) 
will not reference RangeSet directly and therefore would not be impacted. So I 
think it’s an acceptable risk.

Julian



Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-09-29 Thread Vladimir Sitnikov
Julian>Suppose - worst case scenario - that Guava 30 removes RangeSet. The
solution would be for us to copy RangeSet and enough dependent classes into
Calcite to serve our needs

Then all the clients would have to adjust packages because we can't copy
Guava code and keep Guava's package names.

Julian>I knew the risks going in, and I’m not very concerned

I see Guava adjusted RangeSet#toString(). The change like that triggers
lots of plan representation changes :-(

Vladimir


Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-09-29 Thread Julian Hyde
I knew the risks going in, and I’m not very concerned. Suppose - worst case 
scenario - that Guava 30 removes RangeSet. The solution would be for us to copy 
RangeSet and enough dependent classes into Calcite to serve our needs. Our 
support for Guava 30 would be delayed by a month or two.

Sarg is a powerful abstraction for planning queries, and RangeSet is a good 
library to implement it on. If it hadn’t been in Guava, we would have had to 
implement something like it ourselves. (In fact, I recall one Calcite PR that 
basically had its own implementation of RangeSet.)

Julian


> On Sep 29, 2020, at 7:44 AM, Vladimir Sitnikov  
> wrote:
> 
> Apparently, RangeSet is a @Beta API in Guava, and they stress that @Beta
> APIs should not be used in libraries.
> 
> I suggest we do something with that, otherwise, it results in a case where
> Calcite enforces all the consumers to depend on @Beta API which
> might disappear or drift.
> 
> See https://github.com/google/guava/wiki/PhilosophyExplained#beta-apis
> 
> It is literally the very first "important warning":
> 
> https://github.com/google/guava#important-warnings
> 
> 1, APIs marked with the @Beta annotation at the class or method level are
> subject to change.
> They can be modified in any way, or even removed, at any time. If your
> code is a library itself
> (i.e., it is used on the CLASSPATH of users outside your own control),
> you should not use beta APIs unless you repackage them. If your code is a
> library,
> we strongly recommend using the Guava Beta Checker to ensure that you do
> not use any @Beta APIs!
> 
> Should we revert RangeSets?
> Should we ask Guava to promote it from @Beta?
> 
> Vladimir



Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-09-29 Thread Vladimir Sitnikov
Apparently, RangeSet is a @Beta API in Guava, and they stress that @Beta
APIs should not be used in libraries.

I suggest we do something with that, otherwise, it results in a case where
Calcite enforces all the consumers to depend on @Beta API which
might disappear or drift.

See https://github.com/google/guava/wiki/PhilosophyExplained#beta-apis

It is literally the very first "important warning":

https://github.com/google/guava#important-warnings

1, APIs marked with the @Beta annotation at the class or method level are
subject to change.
 They can be modified in any way, or even removed, at any time. If your
code is a library itself
(i.e., it is used on the CLASSPATH of users outside your own control),
you should not use beta APIs unless you repackage them. If your code is a
library,
we strongly recommend using the Guava Beta Checker to ensure that you do
not use any @Beta APIs!

Should we revert RangeSets?
Should we ask Guava to promote it from @Beta?

Vladimir


Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-08-30 Thread Chunwei Lei
Hi, Julian.

I would like to review it.


Best,
Chunwei


On Sat, Aug 29, 2020 at 5:19 AM Julian Hyde  wrote:

> I now have a PR for this change. Can some people review?
>
> https://github.com/apache/calcite/pull/2124 <
> https://github.com/apache/calcite/pull/2124>
>
> There are legitimate concerns that we are adding a new RexCall operator
> (SEARCH) and there is a considerable effort to support it everywhere. But
> we are removing two RexCall operators (IN-list and BETWEEN), because SEARCH
> subsumes them.
>
> The benefit is that we can do ‘range algebra’ simplifications without
> resorting to error-prone Boolean logic simplifications as often as we do
> today.
>
> I have added utilities to introduce SEARCH into SEARCH-less RexNode
> expressions, and to remove SEARCH from RexNode expressions, and to convert
> RexNode expressions that contain SEARCH directly into SqlNode expressions.
>
> In several places, I just expanded SEARCH, but it would be better to
> handle SEARCH explicitly, as if it were a comparison operator (along with
> =, <, <= etc.) It might be worth revisiting the Mongo, Geode and Druid
> adapters. Geode has an ‘IS SET’ construct, very similar to SQL ‘IN-list’,
> that would easier to recognize on SEARCH than on OR.
>
> Julian
>
>
>
>
> > On Aug 13, 2020, at 12:48 PM, Julian Hyde 
> wrote:
> >
> > I have logged https://issues.apache.org/jira/browse/CALCITE-4173 <
> https://issues.apache.org/jira/browse/CALCITE-4173> and am close to a
> prototype.
> >
> > See comments inline, below.
> >
> >> On Aug 11, 2020, at 3:39 PM, Haisheng Yuan  hy...@apache.org>> wrote:
> >>
> >>> I am proposing to use sargs only where we would today use RexCall(IN).
> The data structures have about the same size. Sargs are sorted. I cannot
> see any cases that would become more expensive.
> >>
> >> IIRC, RexCall(IN) is not supported yet.
> >
> > It’s not officially supported. But it is there - added in
> https://issues.apache.org/jira/browse/CALCITE-2444 <
> https://issues.apache.org/jira/browse/CALCITE-2444> for just Rel-to-Sql
> part of the process, and people seem to be using it elsewhere, e.g.
> https://issues.apache.org/jira/browse/CALCITE-1413 <
> https://issues.apache.org/jira/browse/CALCITE-1413> handles IN in CASE.
> >
> > I have not used RexCall.IN personally. I think there’s too much cost (in
> making all places aware of the IN operator) and not enough benefit (because
> it cannot handle ranges or <>).
> >
> > One part of the cost is null-semantics. If the values are not literals,
> one of them might evaluate to NULL. So we have to be very conservative,
> especially when optimizing “NOT IN (list)”. There are no tests for that
> simplification, so we’re probably doing it wrong.
> >
> > So, I propose to remove IN.
> >
> >> With sargs, you have to sort it, no matter explicitly or implicitly, in
> case 20k values, it will take some time.
> >
> > I am not too concerned about the cost of sorting. If you have parsed and
> validated an IN list with 20k items, the query has already burned a few CPU
> cycles.
> >
> > The Sarg data structure is immutable and one instance should suffice for
> the whole planning process, so the cost of sorting on creation is more than
> repaid by the faster access later on.
> >
> >> I am not saying the data structure itself is expensive, but how it is
> used, it all depends on how downstream projects derive stats from it. I
> have seen Greenplum optimizer experienced performance issue for large IN
> list when deriving all the stats info from the thousands of disjoint range
> intervals. Some downstream project may not even use histogram. Some may
> just use the number of IN constants to derive the number of distinct
> values, I would imagine people have to construct the Sarg back to constant
> list, this is an extra burden.
> >
> > If there are performance issues, we can add extra operations to the Sarg
> data structure without changing its size/cost. For example, record whether
> all the ranges are points. And if so, efficiently iterate over all the
> values.
> >
> >> Suppose we have col in (1,2,4,6,8,10, 12,13, 15,17,19, 21,22, 24,
> 24+2.) 10k values, now the range sets will be {[1,2], [4,4],
> [10,10], [12,13], [15,15][21,22], [24,24]}, in the execution
> engine, what if I want to do the look up for this expression? Find out the
> points, extract ranges and disjoint them, or infer points from integer
> ranges?
> >
> > You should transcribe the Sarg into a data structure that makes sense
> for your execution engine. Say a bloom filter, a compressed bitmap, or a
> perfect hash.
> >
> >> IMHO, the interval constraint is more like a kind of logical property
> and can be strategy for simplification, people can derive the interval
> constraints from the IN list, do simplification on it (but I doubt the
> value in practice, which usually brings more trouble than benefits), but we
> don't have to represent it with sarg in RexNode world.
> >
> > I agree that it should be a 

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-08-28 Thread Julian Hyde
I now have a PR for this change. Can some people review?

https://github.com/apache/calcite/pull/2124 


There are legitimate concerns that we are adding a new RexCall operator 
(SEARCH) and there is a considerable effort to support it everywhere. But we 
are removing two RexCall operators (IN-list and BETWEEN), because SEARCH 
subsumes them.

The benefit is that we can do ‘range algebra’ simplifications without resorting 
to error-prone Boolean logic simplifications as often as we do today.

I have added utilities to introduce SEARCH into SEARCH-less RexNode 
expressions, and to remove SEARCH from RexNode expressions, and to convert 
RexNode expressions that contain SEARCH directly into SqlNode expressions.

In several places, I just expanded SEARCH, but it would be better to handle 
SEARCH explicitly, as if it were a comparison operator (along with =, <, <= 
etc.) It might be worth revisiting the Mongo, Geode and Druid adapters. Geode 
has an ‘IS SET’ construct, very similar to SQL ‘IN-list’, that would easier to 
recognize on SEARCH than on OR.

Julian




> On Aug 13, 2020, at 12:48 PM, Julian Hyde  wrote:
> 
> I have logged https://issues.apache.org/jira/browse/CALCITE-4173 
>  and am close to a 
> prototype.
> 
> See comments inline, below.
> 
>> On Aug 11, 2020, at 3:39 PM, Haisheng Yuan > > wrote:
>> 
>>> I am proposing to use sargs only where we would today use RexCall(IN). The 
>>> data structures have about the same size. Sargs are sorted. I cannot see 
>>> any cases that would become more expensive.
>> 
>> IIRC, RexCall(IN) is not supported yet.
> 
> It’s not officially supported. But it is there - added in 
> https://issues.apache.org/jira/browse/CALCITE-2444 
>  for just Rel-to-Sql part 
> of the process, and people seem to be using it elsewhere, e.g. 
> https://issues.apache.org/jira/browse/CALCITE-1413 
>  handles IN in CASE.
> 
> I have not used RexCall.IN personally. I think there’s too much cost (in 
> making all places aware of the IN operator) and not enough benefit (because 
> it cannot handle ranges or <>).
> 
> One part of the cost is null-semantics. If the values are not literals, one 
> of them might evaluate to NULL. So we have to be very conservative, 
> especially when optimizing “NOT IN (list)”. There are no tests for that 
> simplification, so we’re probably doing it wrong.
> 
> So, I propose to remove IN.
> 
>> With sargs, you have to sort it, no matter explicitly or implicitly, in case 
>> 20k values, it will take some time.
> 
> I am not too concerned about the cost of sorting. If you have parsed and 
> validated an IN list with 20k items, the query has already burned a few CPU 
> cycles.
> 
> The Sarg data structure is immutable and one instance should suffice for the 
> whole planning process, so the cost of sorting on creation is more than 
> repaid by the faster access later on.
> 
>> I am not saying the data structure itself is expensive, but how it is used, 
>> it all depends on how downstream projects derive stats from it. I have seen 
>> Greenplum optimizer experienced performance issue for large IN list when 
>> deriving all the stats info from the thousands of disjoint range intervals. 
>> Some downstream project may not even use histogram. Some may just use the 
>> number of IN constants to derive the number of distinct values, I would 
>> imagine people have to construct the Sarg back to constant list, this is an 
>> extra burden. 
> 
> If there are performance issues, we can add extra operations to the Sarg data 
> structure without changing its size/cost. For example, record whether all the 
> ranges are points. And if so, efficiently iterate over all the values.
> 
>> Suppose we have col in (1,2,4,6,8,10, 12,13, 15,17,19, 21,22, 24, 24+2.) 
>> 10k values, now the range sets will be {[1,2], [4,4], [10,10], [12,13], 
>> [15,15][21,22], [24,24]}, in the execution engine, what if I want to 
>> do the look up for this expression? Find out the points, extract ranges and 
>> disjoint them, or infer points from integer ranges?
> 
> You should transcribe the Sarg into a data structure that makes sense for 
> your execution engine. Say a bloom filter, a compressed bitmap, or a perfect 
> hash.
> 
>> IMHO, the interval constraint is more like a kind of logical property and 
>> can be strategy for simplification, people can derive the interval 
>> constraints from the IN list, do simplification on it (but I doubt the value 
>> in practice, which usually brings more trouble than benefits), but we don't 
>> have to represent it with sarg in RexNode world.
> 
> I agree that it should be a logical property. It should be part of 
> RelOptPredicateList. But to apply those predicates efficiently, we need one 
> predicate that searches in 

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-08-11 Thread Haisheng Yuan
> I am proposing to use sargs only where we would today use RexCall(IN). The 
> data structures have about the same size. Sargs are sorted. I cannot see any 
> cases that would become more expensive.

IIRC, RexCall(IN) is not supported yet.

With sargs, you have to sort it, no matter explicitly or implicitly, in case 
20k values, it will take some time. I am not saying the data structure itself 
is expensive, but how it is used, it all depends on how downstream projects 
derive stats from it. I have seen Greenplum optimizer experienced performance 
issue for large IN list when deriving all the stats info from the thousands of 
disjoint range intervals. Some downstream project may not even use histogram. 
Some may just use the number of IN constants to derive the number of distinct 
values, I would imagine people have to construct the Sarg back to constant 
list, this is an extra burden. 

Suppose we have col in (1,2,4,6,8,10, 12,13, 15,17,19, 21,22, 24, 24+2.) 
10k values, now the range sets will be {[1,2], [4,4], [10,10], [12,13], 
[15,15][21,22], [24,24]}, in the execution engine, what if I want to do 
the look up for this expression? Find out the points, extract ranges and 
disjoint them, or infer points from integer ranges?

IMHO, the interval constraint is more like a kind of logical property and can 
be strategy for simplification, people can derive the interval constraints from 
the IN list, do simplification on it (but I doubt the value in practice, which 
usually brings more trouble than benefits), but we don't have to represent it 
with sarg in RexNode world.

> Requiring sargs to support geospatial operations seems an unreasonably high 
> bar. Many techniques that we use (sorting, hashing, sketches) do not support 
> geospatial.
> Areas that have not-totally-ordered data types tend to have their own 
> operators already. For your example I’d write ST_Intersects(col, 
> ST_Collect(area1, area2, area3)), and that would be a perfectly suitable 
> representation in RexNode-land.

I am not asking sargs to support geospatial operations. And geospatial 
intersect is just an example of potential customized operation, which are not 
limited to geospatials. It can be some other binary operations. We can't 
require all the customized operation to have something like ST_Collect. 

Oracle support this:
select * from foo where a > ANY(1, 2, 3, ., 999);
select * from foo where a <= ANY(b+1, b+2, b+3, ., 999);
Although in reality that kind of query might be rare, how would Calcite 
represent in Oracle dialect? And would it be good to support operations other 
than equal, like <, >=, or even customized operation? Especially when 
downstream projects may use calcite to build their own in-house system, some 
are not sql standard. 

BTW, what is your concern about the proposal of RexListCmp [1]?

Haisheng

[1] 
https://lists.apache.org/thread.html/ra36f256e3f86f78ee126455b3ba67dbcb7bd1862214886c65ffba432%40%3Cdev.calcite.apache.org%3E

On 2020/08/11 16:44:56, Julian Hyde  wrote: 
> Let’s bring this discussion back to the original question: for these kinds of 
> predicates, is Sarg a better data structure than IN?
> 
> It absolutely is. There are algorithms in RexSimplify that, for each term in 
> an OR list, simplify using all available predicates, and when they have 
> simplified that term, add it to the list of predicates. It is therefore 
> cartesian in the size of the OR list.
> 
> The sarg data structure converts many kinds of large OR-lists and predicate 
> sets into a single term. Because that term is immutable and based on sorted 
> ranges (or points) it can be handled efficiently (e.g. two sargs can be 
> intersected using a merge, rather than nested loops). That will make our 
> simplification process more efficient.
> 
> Whether we then add new classes of optimization is a discussion for another 
> day.
> 
> Julian
> 
> 
> > On Aug 10, 2020, at 1:13 PM, Vladimir Sitnikov 
> >  wrote:
> > 
> > Julian>I cannot see any cases that would become more expensive.
> > 
> > I mean the optimization passes might be complicated, not the storage of
> > sargs themselves.
> > For instance, CALCITE-4155 converts a in (1, 2, 3, 4, 5) to a >= 1 and a <=
> > 5
> > Is it a significant improvement?
> > Is between representation always better?
> > Does the case appear very often in practice?
> > 
> > However, the optimization does take time, so it looks like extra logic with
> > doubtful gains.
> > Of course, it is unlikely sargs would be a major time consumer, however, if
> > we keep adding tiny simplifications we might
> > end up with a condition where the planning is slow and we have no way to
> > fix it.
> > 
> > I'm ok with people updating RexSimplify (and 4155 looks like an innocent
> > feature), however, I think the current design is not really scalable
> > (e.g. it might process the same expression multiple times).
> > 
> > Vladimir
> 
> 


Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-08-11 Thread Julian Hyde
Let’s bring this discussion back to the original question: for these kinds of 
predicates, is Sarg a better data structure than IN?

It absolutely is. There are algorithms in RexSimplify that, for each term in an 
OR list, simplify using all available predicates, and when they have simplified 
that term, add it to the list of predicates. It is therefore cartesian in the 
size of the OR list.

The sarg data structure converts many kinds of large OR-lists and predicate 
sets into a single term. Because that term is immutable and based on sorted 
ranges (or points) it can be handled efficiently (e.g. two sargs can be 
intersected using a merge, rather than nested loops). That will make our 
simplification process more efficient.

Whether we then add new classes of optimization is a discussion for another day.

Julian


> On Aug 10, 2020, at 1:13 PM, Vladimir Sitnikov  
> wrote:
> 
> Julian>I cannot see any cases that would become more expensive.
> 
> I mean the optimization passes might be complicated, not the storage of
> sargs themselves.
> For instance, CALCITE-4155 converts a in (1, 2, 3, 4, 5) to a >= 1 and a <=
> 5
> Is it a significant improvement?
> Is between representation always better?
> Does the case appear very often in practice?
> 
> However, the optimization does take time, so it looks like extra logic with
> doubtful gains.
> Of course, it is unlikely sargs would be a major time consumer, however, if
> we keep adding tiny simplifications we might
> end up with a condition where the planning is slow and we have no way to
> fix it.
> 
> I'm ok with people updating RexSimplify (and 4155 looks like an innocent
> feature), however, I think the current design is not really scalable
> (e.g. it might process the same expression multiple times).
> 
> Vladimir



Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-08-10 Thread Vladimir Sitnikov
Julian>I cannot see any cases that would become more expensive.

I mean the optimization passes might be complicated, not the storage of
sargs themselves.
For instance, CALCITE-4155 converts a in (1, 2, 3, 4, 5) to a >= 1 and a <=
5
Is it a significant improvement?
Is between representation always better?
Does the case appear very often in practice?

However, the optimization does take time, so it looks like extra logic with
doubtful gains.
Of course, it is unlikely sargs would be a major time consumer, however, if
we keep adding tiny simplifications we might
end up with a condition where the planning is slow and we have no way to
fix it.

I'm ok with people updating RexSimplify (and 4155 looks like an innocent
feature), however, I think the current design is not really scalable
(e.g. it might process the same expression multiple times).

Vladimir


Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-08-10 Thread Julian Hyde



> Vladimir wrote:
> 
> 5. The optimization of disjoint conditions might take significant
> optimization time with insignificant gains.

Can you give an example?

I am proposing to use sargs only where we would today use RexCall(IN). The data 
structures have about the same size. Sargs are sorted. I cannot see any cases 
that would become more expensive.

> Sargs might be helpful as databases often need to implement "integer/string
> index range scan", however, sargs do not seem to suit well for `col
> intersect any(...)` and similar cases.

Requiring sargs to support geospatial operations seems an unreasonably high 
bar. Many techniques that we use (sorting, hashing, sketches) do not support 
geospatial.

> I remember Calcite sources did include sargs package, however, it was
> removed for some reason.

That is correct. We once had b-tree indexes and column store; when we removed 
these, sargs were no longer used. Guava RangeSet is superior to the sargs 
classes we threw away.

Julian



Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-08-10 Thread Vladimir Sitnikov
+1 to Haisheng.

5. The optimization of disjoint conditions might take significant
optimization time with insignificant gains.

Sargs might be helpful as databases often need to implement "integer/string
index range scan", however, sargs do not seem to suit well for `col
intersect any(...)` and similar cases.

I remember Calcite sources did include sargs package, however, it was
removed for some reason.

Vladimir


Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-08-10 Thread Haisheng Yuan
I am against the proposal, for the following reasons:

1. It has to support range operations, not every type supports, especially for 
User Defined Types that are used in IN constant lists, they might only support 
equal operation.

2. The range optimization of ">2 AND <4" to "3” is only valid for integer-like 
types, but I don't think it will bring much gains. Optimizing  col in 
(1,2,3,4,5) to a >=1 and a <=5 is ok, but only ok for a single or very limited 
number of disjoint ranges, but this is a very limited corner case, most likely 
we may end up with many disjoint ranges, especially when there are 10k values. 
I don't think it is worth doing SARG for IN for this sake.

3. For non-integer data types, like double or string, we will end up with 
ranges {[a,a], [b,b], [c,c]...}, the stats derivation e.g. inferring 
selectivity from histogram may take much longer time depends on the 
implementation. And when passing to executor, we have to transform it back to 
form of col = ANY(a,b,c) or col IN (a,b,c) to be able to utilize hash look up.

4. It is not extensible. It can only be used for iN or NOT IN. What about 
customized operators like geospatial intersect? e.g. col intersect ANY(area1, 
area2, area3)

Haisheng

On 2020/08/10 16:15:56, Michael Mior  wrote: 
> The simplifications you present would also of course assume that the
> column has integer type. In any case, overall the proposal seems solid
> to me.
> 
> --
> Michael Mior
> mm...@apache.org
> 
> Le lun. 10 août 2020 à 04:21, Fan Liya  a écrit :
> >
> > Hi Julian,
> >
> > Thanks for opening the discussion.
> >
> > In general, I am convinced of the value and importance of the proposal.
> >
> > I want to discuss more about the algebra involved, as the simplified range
> > sets may not have the minimum computational costs.
> > Some examples:
> >
> > 1. Suppose we have expression x in (1, 2, 4, 5, 6).
> >
> > The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
> > 4 and x <= 6), which represents 4 comparisons and 3 logical operations.
> > Another equivalent expression is x = 1 or x = 2 or x = 4 or x = 5 or x = 6,
> > which represents 5 comparisons and 4 logical operations.
> > It's hard to say which one has a smaller computational cost.
> >
> > 2. Suppose we have expression x in (1, 2, 4, 5)
> > The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
> > 4 and x <= 5).
> > Another (possibly cheaper) reduction should be x >= 1 and x <= 5 and x != 3.
> > I am not sure if the range set implementation supports this type of
> > simplification.
> >
> > Therefore, to address above problems,
> > 1. We may need a model to estimate the cost, and the simplification process
> > should be guided by the cost model.
> > 2. Maybe we need to extend the range set implementation so it produces
> > expressions with minimum computational cost.
> >
> > Best,
> > Liya Fan
> >
> >
> >
> >
> >
> > On Mon, Aug 10, 2020 at 3:27 AM Julian Hyde  wrote:
> >
> > > We have had several discussions over the years about how to represent
> > > IN-lists (e.g. “x IN (1, 3, 5)”) in RexNode-land.
> > >
> > > I have generally taken the position that we should expand to ORs (e.g. “x
> > > = 1 OR x = 3 OR x = 5”) but a few months ago accepted that we should allow
> > > IN in RexCall.
> > >
> > > I have given this some further thought, as part of a couple of RexSimplify
> > > bugs [1] [2] and I now think we should replace IN with something more
> > > powerful, namely sargs. A sarg (or search argument [3]) is an ordered set
> > > of intervals, and can be represented as a Guava ImmutableRangeSet, such as
> > > "[[0‥1), (1‥2], [3‥3], (5‥+∞)]". It can represent an IN-list of constants,
> > > but also ranges.
> > >
> > > Today you would write
> > >
> > >   RexCall(IN, RexInputRef(0), RexLiteral(1), RexLiteral(3), RexLiteral(5))
> > >
> > > With sargs, you would instead write
> > >
> > >   RexCall(IN_SARG, RexInputRef(0), RexSarg(ImmutableRangeSet(“[[1..1],
> > > [3..3], [5..5]]”)))
> > >
> > > There is a new operator IN_SARG, and a new node type RexSarg (it could
> > > almost be a RexLiteral).
> > >
> > > Sargs (and Guava RangeSets) have a powerful and consistent algebra, so if
> > > we invest in sarg support in RexSimplify and RelOptPredicateList, that
> > > investment is likely to pay dividends in better plans.
> > >
> > > Guava RangeSets, and hence sargs, have support for discrete domains, so
> > > they can easily optimize ">2 AND <4" to "3”.
> > >
> > > Sargs would be the preferred (therefore canonical) form for any AND or OR
> > > list that has more than one comparison on the same operand (e.g. "x > 3 
> > > AND
> > > x < 17 AND x <> 10”).
> > >
> > > This proposal would subsume IN and therefore we would stop supporting IN
> > > in RexCall.
> > >
> > > Julian
> > >
> > > [1] https://issues.apache.org/jira/browse/CALCITE-4155 <
> > > https://issues.apache.org/jira/browse/CALCITE-4155>
> > >
> > > [2] 

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-08-10 Thread Julian Hyde
Integer is of course the main example but when I say I like “algebraic” 
approach of Guava range sets I mean that it operates based on the properties of 
data types, not any hard-coded list.

Sargs (and Guava range sets) merely require the values to be comparable (i.e. 
totally ordered). So they would apply to string and float types, for instance. 
Or lexically ordered tuples, (‘a’, 1), (‘b’, 2).

Null does not fit neatly into a total order, so we’d have to find a way to 
finesse expressions like ‘x between 3 and 5 or x is null’.

Sargs (and Guava range sets) are able to do additional optimizations if the 
domain is discrete. The optimization example I gave,  ">2 AND <4" to “3”, uses 
that discrete property. Integers are the main example of discrete domains, but 
“INTERVAL DAY” and “BOOLEAN” are other examples.

Bounded-length strings (e.g. VARCHAR(10)) are also a discrete domain, but the 
Guava documentation recommends against it. Maybe there are just too many 
values. I do think it would be worth treating constrained types, e.g.

  paymentType VARCHAR(6) CHECK (paymentType IN (‘CASH’,’CREDIT’))

as discrete domains.

Julian


> On Aug 10, 2020, at 9:15 AM, Michael Mior  wrote:
> 
> The simplifications you present would also of course assume that the
> column has integer type. In any case, overall the proposal seems solid
> to me.
> 
> --
> Michael Mior
> mm...@apache.org
> 
> Le lun. 10 août 2020 à 04:21, Fan Liya  a écrit :
>> 
>> Hi Julian,
>> 
>> Thanks for opening the discussion.
>> 
>> In general, I am convinced of the value and importance of the proposal.
>> 
>> I want to discuss more about the algebra involved, as the simplified range
>> sets may not have the minimum computational costs.
>> Some examples:
>> 
>> 1. Suppose we have expression x in (1, 2, 4, 5, 6).
>> 
>> The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
>> 4 and x <= 6), which represents 4 comparisons and 3 logical operations.
>> Another equivalent expression is x = 1 or x = 2 or x = 4 or x = 5 or x = 6,
>> which represents 5 comparisons and 4 logical operations.
>> It's hard to say which one has a smaller computational cost.
>> 
>> 2. Suppose we have expression x in (1, 2, 4, 5)
>> The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
>> 4 and x <= 5).
>> Another (possibly cheaper) reduction should be x >= 1 and x <= 5 and x != 3.
>> I am not sure if the range set implementation supports this type of
>> simplification.
>> 
>> Therefore, to address above problems,
>> 1. We may need a model to estimate the cost, and the simplification process
>> should be guided by the cost model.
>> 2. Maybe we need to extend the range set implementation so it produces
>> expressions with minimum computational cost.
>> 
>> Best,
>> Liya Fan
>> 
>> 
>> 
>> 
>> 
>> On Mon, Aug 10, 2020 at 3:27 AM Julian Hyde  wrote:
>> 
>>> We have had several discussions over the years about how to represent
>>> IN-lists (e.g. “x IN (1, 3, 5)”) in RexNode-land.
>>> 
>>> I have generally taken the position that we should expand to ORs (e.g. “x
>>> = 1 OR x = 3 OR x = 5”) but a few months ago accepted that we should allow
>>> IN in RexCall.
>>> 
>>> I have given this some further thought, as part of a couple of RexSimplify
>>> bugs [1] [2] and I now think we should replace IN with something more
>>> powerful, namely sargs. A sarg (or search argument [3]) is an ordered set
>>> of intervals, and can be represented as a Guava ImmutableRangeSet, such as
>>> "[[0‥1), (1‥2], [3‥3], (5‥+∞)]". It can represent an IN-list of constants,
>>> but also ranges.
>>> 
>>> Today you would write
>>> 
>>>  RexCall(IN, RexInputRef(0), RexLiteral(1), RexLiteral(3), RexLiteral(5))
>>> 
>>> With sargs, you would instead write
>>> 
>>>  RexCall(IN_SARG, RexInputRef(0), RexSarg(ImmutableRangeSet(“[[1..1],
>>> [3..3], [5..5]]”)))
>>> 
>>> There is a new operator IN_SARG, and a new node type RexSarg (it could
>>> almost be a RexLiteral).
>>> 
>>> Sargs (and Guava RangeSets) have a powerful and consistent algebra, so if
>>> we invest in sarg support in RexSimplify and RelOptPredicateList, that
>>> investment is likely to pay dividends in better plans.
>>> 
>>> Guava RangeSets, and hence sargs, have support for discrete domains, so
>>> they can easily optimize ">2 AND <4" to "3”.
>>> 
>>> Sargs would be the preferred (therefore canonical) form for any AND or OR
>>> list that has more than one comparison on the same operand (e.g. "x > 3 AND
>>> x < 17 AND x <> 10”).
>>> 
>>> This proposal would subsume IN and therefore we would stop supporting IN
>>> in RexCall.
>>> 
>>> Julian
>>> 
>>> [1] https://issues.apache.org/jira/browse/CALCITE-4155 <
>>> https://issues.apache.org/jira/browse/CALCITE-4155>
>>> 
>>> [2] https://github.com/julianhyde/calcite/tree/4159-simplify <
>>> https://github.com/julianhyde/calcite/tree/4159-simplify>
>>> 
>>> [3] https://en.wikipedia.org/wiki/Sargable <
>>> https://en.wikipedia.org/wiki/Sargable>



Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-08-10 Thread Michael Mior
The simplifications you present would also of course assume that the
column has integer type. In any case, overall the proposal seems solid
to me.

--
Michael Mior
mm...@apache.org

Le lun. 10 août 2020 à 04:21, Fan Liya  a écrit :
>
> Hi Julian,
>
> Thanks for opening the discussion.
>
> In general, I am convinced of the value and importance of the proposal.
>
> I want to discuss more about the algebra involved, as the simplified range
> sets may not have the minimum computational costs.
> Some examples:
>
> 1. Suppose we have expression x in (1, 2, 4, 5, 6).
>
> The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
> 4 and x <= 6), which represents 4 comparisons and 3 logical operations.
> Another equivalent expression is x = 1 or x = 2 or x = 4 or x = 5 or x = 6,
> which represents 5 comparisons and 4 logical operations.
> It's hard to say which one has a smaller computational cost.
>
> 2. Suppose we have expression x in (1, 2, 4, 5)
> The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
> 4 and x <= 5).
> Another (possibly cheaper) reduction should be x >= 1 and x <= 5 and x != 3.
> I am not sure if the range set implementation supports this type of
> simplification.
>
> Therefore, to address above problems,
> 1. We may need a model to estimate the cost, and the simplification process
> should be guided by the cost model.
> 2. Maybe we need to extend the range set implementation so it produces
> expressions with minimum computational cost.
>
> Best,
> Liya Fan
>
>
>
>
>
> On Mon, Aug 10, 2020 at 3:27 AM Julian Hyde  wrote:
>
> > We have had several discussions over the years about how to represent
> > IN-lists (e.g. “x IN (1, 3, 5)”) in RexNode-land.
> >
> > I have generally taken the position that we should expand to ORs (e.g. “x
> > = 1 OR x = 3 OR x = 5”) but a few months ago accepted that we should allow
> > IN in RexCall.
> >
> > I have given this some further thought, as part of a couple of RexSimplify
> > bugs [1] [2] and I now think we should replace IN with something more
> > powerful, namely sargs. A sarg (or search argument [3]) is an ordered set
> > of intervals, and can be represented as a Guava ImmutableRangeSet, such as
> > "[[0‥1), (1‥2], [3‥3], (5‥+∞)]". It can represent an IN-list of constants,
> > but also ranges.
> >
> > Today you would write
> >
> >   RexCall(IN, RexInputRef(0), RexLiteral(1), RexLiteral(3), RexLiteral(5))
> >
> > With sargs, you would instead write
> >
> >   RexCall(IN_SARG, RexInputRef(0), RexSarg(ImmutableRangeSet(“[[1..1],
> > [3..3], [5..5]]”)))
> >
> > There is a new operator IN_SARG, and a new node type RexSarg (it could
> > almost be a RexLiteral).
> >
> > Sargs (and Guava RangeSets) have a powerful and consistent algebra, so if
> > we invest in sarg support in RexSimplify and RelOptPredicateList, that
> > investment is likely to pay dividends in better plans.
> >
> > Guava RangeSets, and hence sargs, have support for discrete domains, so
> > they can easily optimize ">2 AND <4" to "3”.
> >
> > Sargs would be the preferred (therefore canonical) form for any AND or OR
> > list that has more than one comparison on the same operand (e.g. "x > 3 AND
> > x < 17 AND x <> 10”).
> >
> > This proposal would subsume IN and therefore we would stop supporting IN
> > in RexCall.
> >
> > Julian
> >
> > [1] https://issues.apache.org/jira/browse/CALCITE-4155 <
> > https://issues.apache.org/jira/browse/CALCITE-4155>
> >
> > [2] https://github.com/julianhyde/calcite/tree/4159-simplify <
> > https://github.com/julianhyde/calcite/tree/4159-simplify>
> >
> > [3] https://en.wikipedia.org/wiki/Sargable <
> > https://en.wikipedia.org/wiki/Sargable>


Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-08-10 Thread Fan Liya
Hi Julian,

Thanks for opening the discussion.

In general, I am convinced of the value and importance of the proposal.

I want to discuss more about the algebra involved, as the simplified range
sets may not have the minimum computational costs.
Some examples:

1. Suppose we have expression x in (1, 2, 4, 5, 6).

The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
4 and x <= 6), which represents 4 comparisons and 3 logical operations.
Another equivalent expression is x = 1 or x = 2 or x = 4 or x = 5 or x = 6,
which represents 5 comparisons and 4 logical operations.
It's hard to say which one has a smaller computational cost.

2. Suppose we have expression x in (1, 2, 4, 5)
The range set implementation may reduce it to (x >= 1 and x <= 2) or (x >=
4 and x <= 5).
Another (possibly cheaper) reduction should be x >= 1 and x <= 5 and x != 3.
I am not sure if the range set implementation supports this type of
simplification.

Therefore, to address above problems,
1. We may need a model to estimate the cost, and the simplification process
should be guided by the cost model.
2. Maybe we need to extend the range set implementation so it produces
expressions with minimum computational cost.

Best,
Liya Fan





On Mon, Aug 10, 2020 at 3:27 AM Julian Hyde  wrote:

> We have had several discussions over the years about how to represent
> IN-lists (e.g. “x IN (1, 3, 5)”) in RexNode-land.
>
> I have generally taken the position that we should expand to ORs (e.g. “x
> = 1 OR x = 3 OR x = 5”) but a few months ago accepted that we should allow
> IN in RexCall.
>
> I have given this some further thought, as part of a couple of RexSimplify
> bugs [1] [2] and I now think we should replace IN with something more
> powerful, namely sargs. A sarg (or search argument [3]) is an ordered set
> of intervals, and can be represented as a Guava ImmutableRangeSet, such as
> "[[0‥1), (1‥2], [3‥3], (5‥+∞)]". It can represent an IN-list of constants,
> but also ranges.
>
> Today you would write
>
>   RexCall(IN, RexInputRef(0), RexLiteral(1), RexLiteral(3), RexLiteral(5))
>
> With sargs, you would instead write
>
>   RexCall(IN_SARG, RexInputRef(0), RexSarg(ImmutableRangeSet(“[[1..1],
> [3..3], [5..5]]”)))
>
> There is a new operator IN_SARG, and a new node type RexSarg (it could
> almost be a RexLiteral).
>
> Sargs (and Guava RangeSets) have a powerful and consistent algebra, so if
> we invest in sarg support in RexSimplify and RelOptPredicateList, that
> investment is likely to pay dividends in better plans.
>
> Guava RangeSets, and hence sargs, have support for discrete domains, so
> they can easily optimize ">2 AND <4" to "3”.
>
> Sargs would be the preferred (therefore canonical) form for any AND or OR
> list that has more than one comparison on the same operand (e.g. "x > 3 AND
> x < 17 AND x <> 10”).
>
> This proposal would subsume IN and therefore we would stop supporting IN
> in RexCall.
>
> Julian
>
> [1] https://issues.apache.org/jira/browse/CALCITE-4155 <
> https://issues.apache.org/jira/browse/CALCITE-4155>
>
> [2] https://github.com/julianhyde/calcite/tree/4159-simplify <
> https://github.com/julianhyde/calcite/tree/4159-simplify>
>
> [3] https://en.wikipedia.org/wiki/Sargable <
> https://en.wikipedia.org/wiki/Sargable>


[DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

2020-08-09 Thread Julian Hyde
We have had several discussions over the years about how to represent IN-lists 
(e.g. “x IN (1, 3, 5)”) in RexNode-land.

I have generally taken the position that we should expand to ORs (e.g. “x = 1 
OR x = 3 OR x = 5”) but a few months ago accepted that we should allow IN in 
RexCall.

I have given this some further thought, as part of a couple of RexSimplify bugs 
[1] [2] and I now think we should replace IN with something more powerful, 
namely sargs. A sarg (or search argument [3]) is an ordered set of intervals, 
and can be represented as a Guava ImmutableRangeSet, such as "[[0‥1), (1‥2], 
[3‥3], (5‥+∞)]". It can represent an IN-list of constants, but also ranges.

Today you would write

  RexCall(IN, RexInputRef(0), RexLiteral(1), RexLiteral(3), RexLiteral(5))

With sargs, you would instead write

  RexCall(IN_SARG, RexInputRef(0), RexSarg(ImmutableRangeSet(“[[1..1], [3..3], 
[5..5]]”)))

There is a new operator IN_SARG, and a new node type RexSarg (it could almost 
be a RexLiteral).

Sargs (and Guava RangeSets) have a powerful and consistent algebra, so if we 
invest in sarg support in RexSimplify and RelOptPredicateList, that investment 
is likely to pay dividends in better plans.

Guava RangeSets, and hence sargs, have support for discrete domains, so they 
can easily optimize ">2 AND <4" to "3”.

Sargs would be the preferred (therefore canonical) form for any AND or OR list 
that has more than one comparison on the same operand (e.g. "x > 3 AND x < 17 
AND x <> 10”).
 
This proposal would subsume IN and therefore we would stop supporting IN in 
RexCall.

Julian

[1] https://issues.apache.org/jira/browse/CALCITE-4155 


[2] https://github.com/julianhyde/calcite/tree/4159-simplify 
 

[3] https://en.wikipedia.org/wiki/Sargable