The discussion became more about having IN in a RexCall so I logged CALCITE-7232 [1] sharing also some background context about challenges in Apache Hive. We can continue the discussion there and see if we can reach consensus.
[1] https://issues.apache.org/jira/browse/CALCITE-7232 On Wed, Oct 15, 2025 at 5:55 PM Stephen Carlin <[email protected]> wrote: > > This will perhaps be a reiteration of what I said in my previous email, but > here are the major highlights, I think > > - SEARCH is something that makes rules more efficient. A very nice > structure indeed and very handy. > - I'm guessing the RexUtil.expandSearch() method was probably created > because many databases cannot deal with SEARCH directly. So it gets > expanded to existing RexCall operators to standard SQL operations that all > databases understand. I hope I am getting this right. A very convenient > and nice method for the databases that don't implement SEARCH, I imagine. > - The expandSearch method can really only have one type of expansion > (config parameters notwithstanding). So while many SQL functions can map > into 1 SEARCH operator, only one of those creations can be chosen when > expanding. > - IN is the only cases I am aware of (are there others?) where, at > SqlToRelConverter time, the O(1) function concept is just thrown away. It > either creates a subquery RelNode tree, losing the concept of a pure > function or it changes it into an inefficient O(m) function with multiple > ORs. *This is actually my main issue*, not the SEARCH operator. > - I would imagine pretty much all databases understand a concept of an IN > function as it is part of the SQL standard, I think? Yes, IN is > overloaded, but this specialized use of it is very common, common enough > whereI would imagine the non-SEARCH databases prolly implement this as a > function with an O(1) search time. > > All these points together point to the missing functionality for me. While > CALCITE-7226 slightly addresses this issue, it really doesn't solve the > main problem for me in that the expandSearch generates inefficient SQL for > my database. The RexUtil.expandSearch method is essentially unusable and I > have to resort to writing my own expandSearch. Or I have to write my own > "OR -> IN' detection and convert it into my custom IN operator. > > On Wed, Oct 15, 2025 at 6:11 AM Stephen Carlin <[email protected]> wrote: > > > What you said makes perfect sense to me about restricting the functions. > > Definitely think it was a good idea to create it and implement it. > > > > But I do think you are slightly off about where the tension lies. I think > > the tension is on a very specific issue for the most part, at least for me. > > > > If I'm understanding what you are saying about the tension correctly: I > > think your NVL example is there to show that you can't really guess which > > way to do the expand search? And this is true. However, the difference > > between NVL and a case statement is prolly not something I care much > > about. There are only 2 parameters in an NVL function. If you give me a > > CASE when I originally had an NVL my first reaction would be, "Where the > > heck did that come from?" followed by "meh, they do the same thing and the > > efficiency isn't much different, the end user can figure out that they're > > the same thing, I'll get around to fixing that later, if ever." > > > > But the IN case is far, far different. It is expanded as "A = 1 OR A = 2 > > OR A =3 OR ...." and 100 more ORs which is a huge headache. 100 OR > > statements is quite the nightmare compared to 1 IN statement. I really > > need that expansion to an IN operator. > > > > Not sure how many operators fit the IN model as opposed to the NVL model, > > but I'd imagine it's not a lot. I'm guessing the main complaints came > > about IN. > > > > On Tue, Oct 14, 2025 at 4:17 PM Julian Hyde <[email protected]> > > wrote: > > > >> SEARCH is a semi-join too. It’s just a semi-join to an infinite set (the > >> points on the number line, or any totally-ordered type). :) > >> > >> I agree that it’s useful to have multiple functions, and mappings between > >> them. We certainly allow multiple functions in the SQL we consume, and the > >> SQL we produce (via the JDBC adapter). And we map between them by > >> desugaring (e.g. "a BETWEEN b AND c" ==> "a >= b AND a <= c”.) > >> > >> It is very advantageous to restrict the set of functions that can be used > >> in planner rules and simplifications. Since SEARCH can represent BETWEEN > >> (on a constant range) and >, the argument goes, let’s convert them both to > >> SEARCH, and our rules can be simpler. > >> > >> I am a fan of this approach. It is much harder to produce smart, > >> consistent, robust planning rules if we let the surface area get too large. > >> > >> This approach is at tension with people who want and expect their > >> operators to pass through planning unchanged. If someone wrote "NVL(a, b)" > >> they are surprised if it comes out as "CASE WHEN a IS NULL THEN b ELSE a”. > >> I guess we can mitigate by re-sugaring on the backend, before we generate > >> SQL. > >> > >> Julian > >> > >> > >> > >> > >> > On Oct 14, 2025, at 1:51 PM, Stephen Carlin > >> <[email protected]> wrote: > >> > > >> > Heh, it's funny you say that because Stamatis filed this improvement > >> > request in a meeting where I was complaining about things being more > >> > complicated and having broken code because of SEARCH :) > >> > > >> > My main complaint is this, and perhaps it makes sense to file another > >> Jira > >> > for a potential new improvement: > >> > > >> > I totally get SEARCH and its usefulness. So I have no complaints about > >> the > >> > existence of SEARCH, why it is needed, and why it makes sense. > >> > > >> > My problem with SEARCH is simply that it makes my code more complex. I > >> > kinda think of it this way: The SEARCH operator is a superset of the > >> > BETWEEN operator. And we either are ok with having a representation for > >> > both or we are stuck with it due to legacy reasons. I prefer the former > >> > explanation, heh. So why not have an IN_FUNC operator as well? No > >> need to > >> > get rid of SEARCH! Just keep it as a superset. And then create a > >> RexUtil > >> > method that does the work for me that converts a relevant SEARCH to an > >> > IN_FUNC. > >> > > >> > Perhaps the general issue with IN is this: Like SEARCH, IN also has a > >> > broader definition than just a function, but the broader definition is > >> > different from SEARCH. In our internal discussion, we talked about how > >> > "IN" is always a semi-join of some nature, sometimes of course with a > >> > subquery. I believe this is how Calcite treats IN. But in practice, IN > >> > has a specialized case as something that fits right into what a > >> > RexCall/function does. I think many databases (at least the two I work > >> on) > >> > implement this specialized case of IN as a function internally, just > >> like > >> > BETWEEN. So to have a 1:1 RexCall operator to function for IN just > >> seems > >> > to make intuitive sense for me. > >> > > >> > Perhaps we don't have a problem with the existence of a BETWEEN operator > >> > stems from the fact that BETWEEN doesn't have the same type of overload > >> > that IN does. > >> > > >> > So yeah though: If we don't think that an IN type operator is good for > >> > Calcite, I need to develop code local to my database that does a > >> conversion > >> > to a function which differs from my normal RexCall conversion. And my > >> > intuition tells me that this effort has already been done elsewhere. > >> Which > >> > is why I think it belongs in Calcite, where the community would benefit. > >> > > >> > Anyway, having said all this: the improvement in CALCITE-7226 will > >> > definitely help me out because creating all those ORs before generating > >> a > >> > SEARCH seems quite inefficient. It might even lead to problems in my > >> code > >> > as the multiple ORs generate many layers of RexNodes which I think > >> created > >> > some really slow compilations for me during one of my tests. > >> > > >> > On Tue, Oct 14, 2025 at 11:30 AM Julian Hyde <[email protected]> > >> wrote: > >> > > >> >> It’s nice to see issues like > >> >> https://issues.apache.org/jira/browse/CALCITE-7226 "Convert IN to > >> SEARCH > >> >> in SqlToRelConverter” being logged. > >> >> > >> >> When I introduced SEARCH I got a lot of complaints about how I was > >> making > >> >> plans more complicated and generally breaking things. All I was really > >> >> doing was bringing the concept of a Sarg [1] into our logical algebra. > >> I’m > >> >> glad that people now see that SEARCH leads to more robust, efficient > >> plans. > >> >> > >> >> Julian > >> >> > >> >> [1] https://en.wikipedia.org/wiki/Sargable > >> > >>
