Re: SEARCH and Sargs

Stephen Carlin Thu, 16 Oct 2025 13:15:10 -0700

I think we're all in agreement then!  I'm all for banning IN and BETWEEN
during planning.  Not sure if your position is this extreme, but if we can
ban "=", ">", "<", and others, I wouldn't say no (though I wouldn't demand
it either).


Gonna spitball an idea off the top of my head that might be too big to
develop.  But it sounds like there's a concept of internal planning
operators and external database operators.  While plenty of these overlap,
there seems to be a good reason to separate the two concepts.  Maybe it
could be as simple as having certain operators implement a dummy interface
that could be cast?  Idk, just thinking out loud.

On Thu, Oct 16, 2025 at 11:02 AM Julian Hyde <[email protected]> wrote:

> My position is that during planning, IN and BETWEEN should not exist. All
> such ranges should be represented by SEARCH.
>
> In 7232 Stamatis points out that it would be useful to have an IN RexCall
> if we’re just about to generate an IN clause for some particular database.
> I agree with that position, but we have to find a practical and efficient
> way to ban IN from the planning process.
>
> Julian
>
>
> > On Oct 16, 2025, at 8:02 AM, Stamatis Zampetakis <[email protected]>
> wrote:
> >
> > The discussion became more about having IN in a RexCall so I logged
> > CALCITE-7232 [1] sharing also some background context about challenges
> > in Apache Hive. We can continue the discussion there and see if we can
> > reach consensus.
> >
> > [1] https://issues.apache.org/jira/browse/CALCITE-7232
> >
> > On Wed, Oct 15, 2025 at 5:55 PM Stephen Carlin
> > <[email protected]> wrote:
> >>
> >> This will perhaps be a reiteration of what I said in my previous email,
> but
> >> here are the major highlights, I think
> >>
> >> - SEARCH is something that makes rules more efficient. A very nice
> >> structure indeed and very handy.
> >> - I'm guessing the RexUtil.expandSearch() method was probably created
> >> because many databases cannot deal with SEARCH directly.  So it gets
> >> expanded to existing RexCall operators to standard SQL operations that
> all
> >> databases understand.  I hope I am getting this right.  A very
> convenient
> >> and nice method for the databases that don't implement SEARCH, I
> imagine.
> >> - The expandSearch method can really only have one type of expansion
> >> (config parameters notwithstanding).  So while many SQL functions can
> map
> >> into 1 SEARCH operator, only one of those creations can be chosen when
> >> expanding.
> >> - IN is the only cases I am aware of (are there others?) where, at
> >> SqlToRelConverter time, the O(1) function concept is just thrown away.
> It
> >> either creates a subquery RelNode tree, losing the concept of a pure
> >> function or it changes it into an inefficient O(m) function with
> multiple
> >> ORs.  *This is actually my main issue*, not the SEARCH operator.
> >> - I would imagine pretty much all databases understand a concept of an
> IN
> >> function as it is part of the SQL standard, I think?  Yes, IN is
> >> overloaded, but this specialized use of it is very common, common enough
> >> whereI would imagine the non-SEARCH databases prolly implement this as a
> >> function with an O(1) search time.
> >>
> >> All these points together point to the missing functionality for me.
> While
> >> CALCITE-7226 slightly addresses this issue, it really doesn't solve the
> >> main problem for me in that the expandSearch generates inefficient SQL
> for
> >> my database.  The RexUtil.expandSearch method is essentially unusable
> and I
> >> have to resort to writing my own expandSearch.  Or I have to write my
> own
> >> "OR -> IN' detection and convert it into my custom IN operator.
> >>
> >> On Wed, Oct 15, 2025 at 6:11 AM Stephen Carlin <[email protected]>
> wrote:
> >>
> >>> What you said makes perfect sense to me about restricting the
> functions.
> >>> Definitely think it was a good idea to create it and implement it.
> >>>
> >>> But I do think you are slightly off about where the tension lies.  I
> think
> >>> the tension is on a very specific issue for the most part, at least
> for me.
> >>>
> >>> If I'm understanding what you are saying about the tension correctly: I
> >>> think your NVL example is there to show that you can't really guess
> which
> >>> way to do the expand search?  And this is true.  However, the
> difference
> >>> between NVL and a case statement is prolly not something I care much
> >>> about.  There are only 2 parameters in an NVL function.  If you give
> me a
> >>> CASE when I originally had an NVL  my first reaction would be, "Where
> the
> >>> heck did that come from?" followed by  "meh, they do the same thing
> and the
> >>> efficiency isn't much different, the end user can figure out that
> they're
> >>> the same thing, I'll get around to fixing that later, if ever."
> >>>
> >>> But the IN case is far, far different.  It is expanded as  "A = 1 OR A
> = 2
> >>> OR A =3 OR ...." and 100 more ORs which is a huge headache.  100 OR
> >>> statements is quite the nightmare compared to 1 IN statement.  I really
> >>> need that expansion to an IN operator.
> >>>
> >>> Not sure how many operators fit the IN model as opposed to the NVL
> model,
> >>> but I'd imagine it's not a lot.  I'm guessing the main complaints came
> >>> about IN.
> >>>
> >>> On Tue, Oct 14, 2025 at 4:17 PM Julian Hyde <[email protected]>
> >>> wrote:
> >>>
> >>>> SEARCH is a semi-join too. It’s just a semi-join to an infinite set
> (the
> >>>> points on the number line, or any totally-ordered type). :)
> >>>>
> >>>> I agree that it’s useful to have multiple functions, and mappings
> between
> >>>> them. We certainly allow multiple functions in the SQL we consume,
> and the
> >>>> SQL we produce (via the JDBC adapter). And we map between them by
> >>>> desugaring (e.g. "a BETWEEN b AND c" ==> "a >= b AND a <= c”.)
> >>>>
> >>>> It is very advantageous to restrict the set of functions that can be
> used
> >>>> in planner rules and simplifications. Since SEARCH can represent
> BETWEEN
> >>>> (on a constant range) and >, the argument goes, let’s convert them
> both to
> >>>> SEARCH, and our rules can be simpler.
> >>>>
> >>>> I am a fan of this approach. It is much harder to produce smart,
> >>>> consistent, robust planning rules if we let the surface area get too
> large.
> >>>>
> >>>> This approach is at tension with people who want and expect their
> >>>> operators to pass through planning unchanged. If someone wrote
> "NVL(a, b)"
> >>>> they are surprised if it comes out as "CASE WHEN a IS NULL THEN b
> ELSE a”.
> >>>> I guess we can mitigate by re-sugaring on the backend, before we
> generate
> >>>> SQL.
> >>>>
> >>>> Julian
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> On Oct 14, 2025, at 1:51 PM, Stephen Carlin
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>> Heh, it's funny you say that because Stamatis filed this improvement
> >>>>> request in a meeting where I was complaining about things being more
> >>>>> complicated and having broken code because of SEARCH :)
> >>>>>
> >>>>> My main complaint is this, and perhaps it makes sense to file another
> >>>> Jira
> >>>>> for a potential new improvement:
> >>>>>
> >>>>> I totally get SEARCH and its usefulness.  So I have no complaints
> about
> >>>> the
> >>>>> existence of SEARCH, why it is needed, and why it makes sense.
> >>>>>
> >>>>> My problem with SEARCH is simply that it makes my code more
> complex.  I
> >>>>> kinda think of it this way:  The SEARCH operator is a superset of the
> >>>>> BETWEEN operator.  And we either are ok with having a representation
> for
> >>>>> both or we are stuck with it due to legacy reasons.  I prefer the
> former
> >>>>> explanation, heh.  So why not have an IN_FUNC operator as well?  No
> >>>> need to
> >>>>> get rid of SEARCH!  Just keep it as a superset.  And then create a
> >>>> RexUtil
> >>>>> method that does the work for me that converts a relevant SEARCH to
> an
> >>>>> IN_FUNC.
> >>>>>
> >>>>> Perhaps the general issue with IN is this: Like SEARCH, IN also has a
> >>>>> broader definition than just a function, but the broader definition
> is
> >>>>> different from SEARCH.  In our internal discussion, we talked about
> how
> >>>>> "IN" is always a semi-join of some nature, sometimes of course with a
> >>>>> subquery.  I believe this is how Calcite treats IN.  But in
> practice, IN
> >>>>> has a specialized case as something that fits right into what a
> >>>>> RexCall/function does.  I think many databases (at least the two I
> work
> >>>> on)
> >>>>> implement this specialized case of IN as a function internally, just
> >>>> like
> >>>>> BETWEEN.  So to have a 1:1 RexCall operator to function for IN just
> >>>> seems
> >>>>> to make intuitive sense for me.
> >>>>>
> >>>>> Perhaps we don't have a problem with the existence of a BETWEEN
> operator
> >>>>> stems from the fact that BETWEEN doesn't have the same type of
> overload
> >>>>> that IN does.
> >>>>>
> >>>>> So yeah though: If we don't think that an IN type operator is good
> for
> >>>>> Calcite, I need to develop code local to my database that does a
> >>>> conversion
> >>>>> to a function which differs from my normal RexCall conversion.   And
> my
> >>>>> intuition tells me that this effort has already been done elsewhere.
> >>>> Which
> >>>>> is why I think it belongs in Calcite, where the community would
> benefit.
> >>>>>
> >>>>> Anyway, having said all this:  the improvement in CALCITE-7226 will
> >>>>> definitely help me out because creating all those ORs before
> generating
> >>>> a
> >>>>> SEARCH seems quite inefficient. It might even lead to problems in my
> >>>> code
> >>>>> as the multiple ORs generate many layers of RexNodes which I think
> >>>> created
> >>>>> some really slow compilations for me during one of my tests.
> >>>>>
> >>>>> On Tue, Oct 14, 2025 at 11:30 AM Julian Hyde <[email protected]
> >
> >>>> wrote:
> >>>>>
> >>>>>> It’s nice to see issues like
> >>>>>> https://issues.apache.org/jira/browse/CALCITE-7226 "Convert IN to
> >>>> SEARCH
> >>>>>> in SqlToRelConverter” being logged.
> >>>>>>
> >>>>>> When I introduced SEARCH I got a lot of complaints about how I was
> >>>> making
> >>>>>> plans more complicated and generally breaking things. All I was
> really
> >>>>>> doing was bringing the concept of a Sarg [1] into our logical
> algebra.
> >>>> I’m
> >>>>>> glad that people now see that SEARCH leads to more robust, efficient
> >>>> plans.
> >>>>>>
> >>>>>> Julian
> >>>>>>
> >>>>>> [1] https://en.wikipedia.org/wiki/Sargable
> >>>>
> >>>>
>
>

Re: SEARCH and Sargs

Reply via email to