[jira] [Commented] (CALCITE-6467) Performance of RelMdUtil.checkInputForCollationAndLimit when using `where col in (large literal set)`

Gonzalo Ortiz (Jira) Tue, 25 Feb 2025 02:50:52 -0800


    [ 
https://issues.apache.org/jira/browse/CALCITE-6467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930267#comment-17930267
 ]


Gonzalo Ortiz commented on CALCITE-6467:
----------------------------------------

Hi there,

I'm sorry for reviving this thread and for not answering before. In Apache
Pinot, we have been trying to apply different patches to solve the problems
that originated from https://github.com/apache/calcite/pull/2743, but we
haven't had any luck yet.

Yes, we are setting this threshold to Integer.MAX_VALUE because we have
several optimizations that cannot be applied when expressing an OR as a
subquery. At the same time, the vast majority of the large INs used by our
users are just a list of literals. We really don't want to apply any kind
of optimization to that list, either at Calcite's SQL node to RelNode or at
RelNode optimization. We understand that this conversion could be desired
in some cases, but it should be optional. As far as we know, there is no
way to disable it.

We don't completely understand why our only options are transforming IN
expressions into a list of ORs or a subquery. Shouldn't we be able to keep
the IN expression (or others) as they are? From our perspective, this lack
of choice is a bad design. We had similar problems in the past when some
expressions in SqlNode were simplified at the same time nodes were
converted into RelNode and the only solution we found was to apply a
visitor that applies our own checks (see
https://github.com/apache/pinot/blob/master/pinot-query-planner/src/main/java/org/apache/pinot/query/validate/BytesCastVisitor.java)
after SqlNode validation but before converting them to RelNode because
there was no way to customize the opinionated logic applied for castings.


The solutions we have tried or thinking about:
1. Reduce the threshold to 0 and apply some RelNode rules to revert the
change later. We are not sure if all possible cases can be detected or
reverted
2. Substitute the IN function in Calcite's function registry for a new
function whose kind should be different from IN/NOT_IN so we don't detect
it as a subquery. This can obviously have side effects elsewhere.
3. Create our own fork where SqlToRelConverter.findSubQueries doesn't
detect IN without subqueries as a subquery. This may also have side
effects. TBH we don't want to have our own Calcite fork, so we would prefer
to contribute that code into the main repo.

It is shocking that Calcite, which can be tuned as much as we want when
applying RelNode rules, is so monolithic and opinionated when applying
_optimizations_ while the query is still at SqlNode. Is there any reason
for that? Is Pinot the only project affected by that?



El vie, 12 jul 2024 a las 17:19, Julian Hyde (Jira) (<[email protected]>)



> Performance of RelMdUtil.checkInputForCollationAndLimit when using `where col 
> in (large literal set)`
> -----------------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-6467
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6467
>             Project: Calcite
>          Issue Type: Improvement
>    Affects Versions: 1.32.0, 1.33.0, 1.34.0, 1.35.0, 1.36.0, 1.37.0, 1.38.0
>            Reporter: Gonzalo Ortiz
>            Priority: Major
>         Attachments: MultistageEngineQuickStart_2024_07_12_111558.jfr, 
> image-2024-07-12-15-58-33-504.png
>
>
> Recently we have updated Pinot to use Calcite 1.37. Previously we were using 
> 1.31.
> After the upgrade, we have found issues when executing some queries that 
> include large IN clauses. Queries like:
> {code:java}
> explain plan for
> SELECT DestCityName
> FROM (
>          SELECT DestCityName
>          FROM airlineStats
>          WHERE DestCityName IN (
>                         'a1', 'a2', 'a3', ... 'a300'
>              )
>          GROUP BY DestCityName
>      ) as a
> {code}
> After some debug, we have found that the issue is in one of our custom rules (
> PinotSortExchangeCopyRule) when we call 
> `RelMdUtil.checkInputForCollationAndLimit` 
> ([link|https://github.com/apache/pinot/blob/dacc6d06907c44e83721454f1090e5f00c824f15/pinot-query-planner/src/main/java/org/apache/pinot/calcite/rel/rules/PinotSortExchangeCopyRule.java#L64]).
>  
> Comparing different Pinot versions, I've found out that the change that is 
> causing problems in our scenario is #CALCITE-5036 (see
>  
> [https://github.com/apache/calcite/pull/2743)|https://github.com/apache/calcite/pull/2743/files).].
>  Specifically, it looks like the problem appears when `RelMdPredicates` tries 
> to simplify the expression, which at the time is basically an 
> `OR(DestCityName = a1, DestCityName = a2, ...)`. 
> `RexSimplify.simplifyOrTerms` negates previous terms in order to apply 
> possible optimizations, but in cases like this where we have hundreds of 
> literals, that is very expensive. Going more in detail, it looks like most of 
> the time is invested in creating new ranges:
> !image-2024-07-12-15-58-33-504.png!
>  
> You may find more insights in the attached JFR file
> [^MultistageEngineQuickStart_2024_07_12_111558.jfr][^MultistageEngineQuickStart_2024_07_12_111558.jfr]
>  
> I've tried to reproduce the problem with `sqline` but I wasn't able to do so. 
> As far as I can see in the code, RelMdUtil.checkInputForCollationAndLimit is 
> only called in SortJoinCopyRule, SortJoinTransposeRule and 
> SortUnionTransposeRule. I've tried to create a test or JMH benchmark in 
> Calcite to try to reproduce the issue, but I don't know codebase well enough.
>  
> I don't consider myself an expert on Apache Calcite and I know we are not 
> using Calcite in the most standard way (we are slowly migrating from our own 
> engine to Calcite), but I'm pretty confident this issue may also affect other 
> Calcite usages. At least in the trace I cannot see anything Pinot specific.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CALCITE-6467) Performance of RelMdUtil.checkInputForCollationAndLimit when using `where col in (large literal set)`

Reply via email to