Re: SEARCH and Sargs

Julian Hyde Sat, 18 Oct 2025 12:20:38 -0700

My position is that during planning, IN and BETWEEN should not exist. All such 
ranges should be represented by SEARCH.


In 7232 Stamatis points out that it would be useful to have an IN RexCall if 
we’re just about to generate an IN clause for some particular database. I agree 
with that position, but we have to find a practical and efficient way to ban IN 
from the planning process.

Julian


> On Oct 16, 2025, at 8:02 AM, Stamatis Zampetakis <[email protected]> wrote:
> 
> The discussion became more about having IN in a RexCall so I logged
> CALCITE-7232 [1] sharing also some background context about challenges
> in Apache Hive. We can continue the discussion there and see if we can
> reach consensus.
> 
> [1] https://issues.apache.org/jira/browse/CALCITE-7232
> 
> On Wed, Oct 15, 2025 at 5:55 PM Stephen Carlin
> <[email protected]> wrote:
>> 
>> This will perhaps be a reiteration of what I said in my previous email, but
>> here are the major highlights, I think
>> 
>> - SEARCH is something that makes rules more efficient. A very nice
>> structure indeed and very handy.
>> - I'm guessing the RexUtil.expandSearch() method was probably created
>> because many databases cannot deal with SEARCH directly.  So it gets
>> expanded to existing RexCall operators to standard SQL operations that all
>> databases understand.  I hope I am getting this right.  A very convenient
>> and nice method for the databases that don't implement SEARCH, I imagine.
>> - The expandSearch method can really only have one type of expansion
>> (config parameters notwithstanding).  So while many SQL functions can map
>> into 1 SEARCH operator, only one of those creations can be chosen when
>> expanding.
>> - IN is the only cases I am aware of (are there others?) where, at
>> SqlToRelConverter time, the O(1) function concept is just thrown away.  It
>> either creates a subquery RelNode tree, losing the concept of a pure
>> function or it changes it into an inefficient O(m) function with multiple
>> ORs.  *This is actually my main issue*, not the SEARCH operator.
>> - I would imagine pretty much all databases understand a concept of an IN
>> function as it is part of the SQL standard, I think?  Yes, IN is
>> overloaded, but this specialized use of it is very common, common enough
>> whereI would imagine the non-SEARCH databases prolly implement this as a
>> function with an O(1) search time.
>> 
>> All these points together point to the missing functionality for me.  While
>> CALCITE-7226 slightly addresses this issue, it really doesn't solve the
>> main problem for me in that the expandSearch generates inefficient SQL for
>> my database.  The RexUtil.expandSearch method is essentially unusable and I
>> have to resort to writing my own expandSearch.  Or I have to write my own
>> "OR -> IN' detection and convert it into my custom IN operator.
>> 
>> On Wed, Oct 15, 2025 at 6:11 AM Stephen Carlin <[email protected]> wrote:
>> 
>>> What you said makes perfect sense to me about restricting the functions.
>>> Definitely think it was a good idea to create it and implement it.
>>> 
>>> But I do think you are slightly off about where the tension lies.  I think
>>> the tension is on a very specific issue for the most part, at least for me.
>>> 
>>> If I'm understanding what you are saying about the tension correctly: I
>>> think your NVL example is there to show that you can't really guess which
>>> way to do the expand search?  And this is true.  However, the difference
>>> between NVL and a case statement is prolly not something I care much
>>> about.  There are only 2 parameters in an NVL function.  If you give me a
>>> CASE when I originally had an NVL  my first reaction would be, "Where the
>>> heck did that come from?" followed by  "meh, they do the same thing and the
>>> efficiency isn't much different, the end user can figure out that they're
>>> the same thing, I'll get around to fixing that later, if ever."
>>> 
>>> But the IN case is far, far different.  It is expanded as  "A = 1 OR A = 2
>>> OR A =3 OR ...." and 100 more ORs which is a huge headache.  100 OR
>>> statements is quite the nightmare compared to 1 IN statement.  I really
>>> need that expansion to an IN operator.
>>> 
>>> Not sure how many operators fit the IN model as opposed to the NVL model,
>>> but I'd imagine it's not a lot.  I'm guessing the main complaints came
>>> about IN.
>>> 
>>> On Tue, Oct 14, 2025 at 4:17 PM Julian Hyde <[email protected]>
>>> wrote:
>>> 
>>>> SEARCH is a semi-join too. It’s just a semi-join to an infinite set (the
>>>> points on the number line, or any totally-ordered type). :)
>>>> 
>>>> I agree that it’s useful to have multiple functions, and mappings between
>>>> them. We certainly allow multiple functions in the SQL we consume, and the
>>>> SQL we produce (via the JDBC adapter). And we map between them by
>>>> desugaring (e.g. "a BETWEEN b AND c" ==> "a >= b AND a <= c”.)
>>>> 
>>>> It is very advantageous to restrict the set of functions that can be used
>>>> in planner rules and simplifications. Since SEARCH can represent BETWEEN
>>>> (on a constant range) and >, the argument goes, let’s convert them both to
>>>> SEARCH, and our rules can be simpler.
>>>> 
>>>> I am a fan of this approach. It is much harder to produce smart,
>>>> consistent, robust planning rules if we let the surface area get too large.
>>>> 
>>>> This approach is at tension with people who want and expect their
>>>> operators to pass through planning unchanged. If someone wrote "NVL(a, b)"
>>>> they are surprised if it comes out as "CASE WHEN a IS NULL THEN b ELSE a”.
>>>> I guess we can mitigate by re-sugaring on the backend, before we generate
>>>> SQL.
>>>> 
>>>> Julian
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Oct 14, 2025, at 1:51 PM, Stephen Carlin
>>>> <[email protected]> wrote:
>>>>> 
>>>>> Heh, it's funny you say that because Stamatis filed this improvement
>>>>> request in a meeting where I was complaining about things being more
>>>>> complicated and having broken code because of SEARCH :)
>>>>> 
>>>>> My main complaint is this, and perhaps it makes sense to file another
>>>> Jira
>>>>> for a potential new improvement:
>>>>> 
>>>>> I totally get SEARCH and its usefulness.  So I have no complaints about
>>>> the
>>>>> existence of SEARCH, why it is needed, and why it makes sense.
>>>>> 
>>>>> My problem with SEARCH is simply that it makes my code more complex.  I
>>>>> kinda think of it this way:  The SEARCH operator is a superset of the
>>>>> BETWEEN operator.  And we either are ok with having a representation for
>>>>> both or we are stuck with it due to legacy reasons.  I prefer the former
>>>>> explanation, heh.  So why not have an IN_FUNC operator as well?  No
>>>> need to
>>>>> get rid of SEARCH!  Just keep it as a superset.  And then create a
>>>> RexUtil
>>>>> method that does the work for me that converts a relevant SEARCH to an
>>>>> IN_FUNC.
>>>>> 
>>>>> Perhaps the general issue with IN is this: Like SEARCH, IN also has a
>>>>> broader definition than just a function, but the broader definition is
>>>>> different from SEARCH.  In our internal discussion, we talked about how
>>>>> "IN" is always a semi-join of some nature, sometimes of course with a
>>>>> subquery.  I believe this is how Calcite treats IN.  But in practice, IN
>>>>> has a specialized case as something that fits right into what a
>>>>> RexCall/function does.  I think many databases (at least the two I work
>>>> on)
>>>>> implement this specialized case of IN as a function internally, just
>>>> like
>>>>> BETWEEN.  So to have a 1:1 RexCall operator to function for IN just
>>>> seems
>>>>> to make intuitive sense for me.
>>>>> 
>>>>> Perhaps we don't have a problem with the existence of a BETWEEN operator
>>>>> stems from the fact that BETWEEN doesn't have the same type of overload
>>>>> that IN does.
>>>>> 
>>>>> So yeah though: If we don't think that an IN type operator is good for
>>>>> Calcite, I need to develop code local to my database that does a
>>>> conversion
>>>>> to a function which differs from my normal RexCall conversion.   And my
>>>>> intuition tells me that this effort has already been done elsewhere.
>>>> Which
>>>>> is why I think it belongs in Calcite, where the community would benefit.
>>>>> 
>>>>> Anyway, having said all this:  the improvement in CALCITE-7226 will
>>>>> definitely help me out because creating all those ORs before generating
>>>> a
>>>>> SEARCH seems quite inefficient. It might even lead to problems in my
>>>> code
>>>>> as the multiple ORs generate many layers of RexNodes which I think
>>>> created
>>>>> some really slow compilations for me during one of my tests.
>>>>> 
>>>>> On Tue, Oct 14, 2025 at 11:30 AM Julian Hyde <[email protected]>
>>>> wrote:
>>>>> 
>>>>>> It’s nice to see issues like
>>>>>> https://issues.apache.org/jira/browse/CALCITE-7226 "Convert IN to
>>>> SEARCH
>>>>>> in SqlToRelConverter” being logged.
>>>>>> 
>>>>>> When I introduced SEARCH I got a lot of complaints about how I was
>>>> making
>>>>>> plans more complicated and generally breaking things. All I was really
>>>>>> doing was bringing the concept of a Sarg [1] into our logical algebra.
>>>> I’m
>>>>>> glad that people now see that SEARCH leads to more robust, efficient
>>>> plans.
>>>>>> 
>>>>>> Julian
>>>>>> 
>>>>>> [1] https://en.wikipedia.org/wiki/Sargable
>>>> 
>>>>

Re: SEARCH and Sargs

Reply via email to