Re: SEARCH and Sargs

Stamatis Zampetakis Sat, 18 Oct 2025 01:18:11 -0700

The discussion became more about having IN in a RexCall so I logged
CALCITE-7232 [1] sharing also some background context about challenges
in Apache Hive. We can continue the discussion there and see if we can
reach consensus.


[1] https://issues.apache.org/jira/browse/CALCITE-7232

On Wed, Oct 15, 2025 at 5:55 PM Stephen Carlin
<[email protected]> wrote:
>
> This will perhaps be a reiteration of what I said in my previous email, but
> here are the major highlights, I think
>
> - SEARCH is something that makes rules more efficient. A very nice
> structure indeed and very handy.
> - I'm guessing the RexUtil.expandSearch() method was probably created
> because many databases cannot deal with SEARCH directly.  So it gets
> expanded to existing RexCall operators to standard SQL operations that all
> databases understand.  I hope I am getting this right.  A very convenient
> and nice method for the databases that don't implement SEARCH, I imagine.
> - The expandSearch method can really only have one type of expansion
> (config parameters notwithstanding).  So while many SQL functions can map
> into 1 SEARCH operator, only one of those creations can be chosen when
> expanding.
> - IN is the only cases I am aware of (are there others?) where, at
> SqlToRelConverter time, the O(1) function concept is just thrown away.  It
> either creates a subquery RelNode tree, losing the concept of a pure
> function or it changes it into an inefficient O(m) function with multiple
> ORs.  *This is actually my main issue*, not the SEARCH operator.
> - I would imagine pretty much all databases understand a concept of an IN
> function as it is part of the SQL standard, I think?  Yes, IN is
> overloaded, but this specialized use of it is very common, common enough
> whereI would imagine the non-SEARCH databases prolly implement this as a
> function with an O(1) search time.
>
> All these points together point to the missing functionality for me.  While
> CALCITE-7226 slightly addresses this issue, it really doesn't solve the
> main problem for me in that the expandSearch generates inefficient SQL for
> my database.  The RexUtil.expandSearch method is essentially unusable and I
> have to resort to writing my own expandSearch.  Or I have to write my own
> "OR -> IN' detection and convert it into my custom IN operator.
>
> On Wed, Oct 15, 2025 at 6:11 AM Stephen Carlin <[email protected]> wrote:
>
> > What you said makes perfect sense to me about restricting the functions.
> > Definitely think it was a good idea to create it and implement it.
> >
> > But I do think you are slightly off about where the tension lies.  I think
> > the tension is on a very specific issue for the most part, at least for me.
> >
> > If I'm understanding what you are saying about the tension correctly: I
> > think your NVL example is there to show that you can't really guess which
> > way to do the expand search?  And this is true.  However, the difference
> > between NVL and a case statement is prolly not something I care much
> > about.  There are only 2 parameters in an NVL function.  If you give me a
> > CASE when I originally had an NVL  my first reaction would be, "Where the
> > heck did that come from?" followed by  "meh, they do the same thing and the
> > efficiency isn't much different, the end user can figure out that they're
> > the same thing, I'll get around to fixing that later, if ever."
> >
> > But the IN case is far, far different.  It is expanded as  "A = 1 OR A = 2
> > OR A =3 OR ...." and 100 more ORs which is a huge headache.  100 OR
> > statements is quite the nightmare compared to 1 IN statement.  I really
> > need that expansion to an IN operator.
> >
> > Not sure how many operators fit the IN model as opposed to the NVL model,
> > but I'd imagine it's not a lot.  I'm guessing the main complaints came
> > about IN.
> >
> > On Tue, Oct 14, 2025 at 4:17 PM Julian Hyde <[email protected]>
> > wrote:
> >
> >> SEARCH is a semi-join too. It’s just a semi-join to an infinite set (the
> >> points on the number line, or any totally-ordered type). :)
> >>
> >> I agree that it’s useful to have multiple functions, and mappings between
> >> them. We certainly allow multiple functions in the SQL we consume, and the
> >> SQL we produce (via the JDBC adapter). And we map between them by
> >> desugaring (e.g. "a BETWEEN b AND c" ==> "a >= b AND a <= c”.)
> >>
> >> It is very advantageous to restrict the set of functions that can be used
> >> in planner rules and simplifications. Since SEARCH can represent BETWEEN
> >> (on a constant range) and >, the argument goes, let’s convert them both to
> >> SEARCH, and our rules can be simpler.
> >>
> >> I am a fan of this approach. It is much harder to produce smart,
> >> consistent, robust planning rules if we let the surface area get too large.
> >>
> >> This approach is at tension with people who want and expect their
> >> operators to pass through planning unchanged. If someone wrote "NVL(a, b)"
> >> they are surprised if it comes out as "CASE WHEN a IS NULL THEN b ELSE a”.
> >> I guess we can mitigate by re-sugaring on the backend, before we generate
> >> SQL.
> >>
> >> Julian
> >>
> >>
> >>
> >>
> >> > On Oct 14, 2025, at 1:51 PM, Stephen Carlin
> >> <[email protected]> wrote:
> >> >
> >> > Heh, it's funny you say that because Stamatis filed this improvement
> >> > request in a meeting where I was complaining about things being more
> >> > complicated and having broken code because of SEARCH :)
> >> >
> >> > My main complaint is this, and perhaps it makes sense to file another
> >> Jira
> >> > for a potential new improvement:
> >> >
> >> > I totally get SEARCH and its usefulness.  So I have no complaints about
> >> the
> >> > existence of SEARCH, why it is needed, and why it makes sense.
> >> >
> >> > My problem with SEARCH is simply that it makes my code more complex.  I
> >> > kinda think of it this way:  The SEARCH operator is a superset of the
> >> > BETWEEN operator.  And we either are ok with having a representation for
> >> > both or we are stuck with it due to legacy reasons.  I prefer the former
> >> > explanation, heh.  So why not have an IN_FUNC operator as well?  No
> >> need to
> >> > get rid of SEARCH!  Just keep it as a superset.  And then create a
> >> RexUtil
> >> > method that does the work for me that converts a relevant SEARCH to an
> >> > IN_FUNC.
> >> >
> >> > Perhaps the general issue with IN is this: Like SEARCH, IN also has a
> >> > broader definition than just a function, but the broader definition is
> >> > different from SEARCH.  In our internal discussion, we talked about how
> >> > "IN" is always a semi-join of some nature, sometimes of course with a
> >> > subquery.  I believe this is how Calcite treats IN.  But in practice, IN
> >> > has a specialized case as something that fits right into what a
> >> > RexCall/function does.  I think many databases (at least the two I work
> >> on)
> >> > implement this specialized case of IN as a function internally, just
> >> like
> >> > BETWEEN.  So to have a 1:1 RexCall operator to function for IN just
> >> seems
> >> > to make intuitive sense for me.
> >> >
> >> > Perhaps we don't have a problem with the existence of a BETWEEN operator
> >> > stems from the fact that BETWEEN doesn't have the same type of overload
> >> > that IN does.
> >> >
> >> > So yeah though: If we don't think that an IN type operator is good for
> >> > Calcite, I need to develop code local to my database that does a
> >> conversion
> >> > to a function which differs from my normal RexCall conversion.   And my
> >> > intuition tells me that this effort has already been done elsewhere.
> >> Which
> >> > is why I think it belongs in Calcite, where the community would benefit.
> >> >
> >> > Anyway, having said all this:  the improvement in CALCITE-7226 will
> >> > definitely help me out because creating all those ORs before generating
> >> a
> >> > SEARCH seems quite inefficient. It might even lead to problems in my
> >> code
> >> > as the multiple ORs generate many layers of RexNodes which I think
> >> created
> >> > some really slow compilations for me during one of my tests.
> >> >
> >> > On Tue, Oct 14, 2025 at 11:30 AM Julian Hyde <[email protected]>
> >> wrote:
> >> >
> >> >> It’s nice to see issues like
> >> >> https://issues.apache.org/jira/browse/CALCITE-7226 "Convert IN to
> >> SEARCH
> >> >> in SqlToRelConverter” being logged.
> >> >>
> >> >> When I introduced SEARCH I got a lot of complaints about how I was
> >> making
> >> >> plans more complicated and generally breaking things. All I was really
> >> >> doing was bringing the concept of a Sarg [1] into our logical algebra.
> >> I’m
> >> >> glad that people now see that SEARCH leads to more robust, efficient
> >> plans.
> >> >>
> >> >> Julian
> >> >>
> >> >> [1] https://en.wikipedia.org/wiki/Sargable
> >>
> >>

Re: SEARCH and Sargs

Reply via email to