Hello Andrei! I've also bumped into this earlier - from a little bit different angle :)
I've found some "back story about it": It works currently like this because a long time ago there was no other way to do the subquery conversion of IN; I think now the preferred way would be to have a rule to do that conversion... About (3) leaving INs as is: I've experimented with that earlier and it seems like there are some places where IN is handled only as a subquery. If we don't open INs; then there is a chance that some expression simplification will not happen (because it's not handled there either). For Hive we've decided to follow the Calcite way for now; so Hive also opens ORs before it calls Calcite - but there's also a rule which tries to close ORs into an IN. https://github.com/apache/hive/blob/236a32c645a21b04ccaf7f18db5c6a5aa53586e8/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HivePointLookupOptimizerRule.java#L217 I plan to get back to this; or at least get CALCITE-2444. cheers, Zoltan On 08/14/2018 07:38 AM, Andrei Sereda wrote:
Hello, I have noticed that our queries get converted into full table scans when they contain IN predicates with long list of elements (eg. id IN ($manyIds) ). Upon some debuging the reason seems to be SqlToRelConverter.substituteSubQuery() <https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/sql2rel/SqlToRelConverter.java#L1001>. Calcite will automatically translate subqueries into joins when DEFAULT_IN_SUB_QUERY_THRESHOLD <https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/sql2rel/SqlToRelConverter.java#L221> is reached (currently 20 elements). We have instances when $manyIds can have 10K of elements (out of 50M rows table). Some questions: 1. In order to skip this query translation seems like I have to change SqlToRelConverter.Config.InSubQueryThreshold to MAX_VALUE. How can one do it in CalciteConnection or PreparedStatement ? Is is the only way ? 2. Is SubQueryConverter <https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/sql2rel/SubQueryConverter.java> of any use in this particular case ? 3. Can one keep IN predicate "as is" without converting to disjunctions (see convertInToOr <https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/sql2rel/SqlToRelConverter.java#L1396>) ? The reason I’m asking is because our elastic queries get pretty big if constructucted using ORs (multiple nested JSONs for each element vs simple JSON array). Example to reproduce (for Mongo Adapter) @Testpublic void subQueryTheshold() { // SqlToRelConverter.DEFAULT_IN_SUB_QUERY_THRESHOLD = 20 // just a long list of ids (>20 to trigger query rewrite) final String ids = IntStream.range(0, SqlToRelConverter.DEFAULT_IN_SUB_QUERY_THRESHOLD) .mapToObj(i -> String.format("'%d'", i)) .collect(Collectors.joining(",")); assertModel(MODEL) .query( "select * from zips where state in (" + ids + ")") .explainContains("MongoFilter") .returns(""); } As usual, thanks a lot for your help. Andrei.
