suneet-s commented on a change in pull request #12195:
URL: https://github.com/apache/druid/pull/12195#discussion_r800227896
##########
File path: docs/querying/query-context.md
##########
@@ -62,7 +62,7 @@ Unless otherwise noted, the following parameters apply to all
query types.
|secondaryPartitionPruning|`true`|Enable secondary partition pruning on the
Broker. The Broker will always prune unnecessary segments from the input scan
based on a filter on time intervals, but if the data is further partitioned
with hash or range partitioning, this option will enable additional pruning
based on a filter on secondary partition dimensions.|
|enableJoinLeftTableScanDirect|`false`|This flag applies to queries which have
joins. For joins, where left child is a simple scan with a filter, by default,
druid will run the scan as a query and the join the results to the right child
on broker. Setting this flag to true overrides that behavior and druid will
attempt to push the join to data servers instead. Please note that the flag
could be applicable to queries even if there is no explicit join. since queries
can internally translated into a join by the SQL planner.|
|debug| `false` | Flag indicating whether to enable debugging outputs for the
query. When set to false, no additional logs will be produced (logs produced
will be entirely dependent on your logging level). When set to true, the
following addition logs will be produced:<br />- Log the stack trace of the
exception (if any) produced by the query |
-
+|maxNumericInFilters|`-1`|Max limit for the amount of numeric values that can
be compared for a string type dimension when the entire SQL WHERE clause of a
query translates to an [IN filter](../querying/filters.md#in-filter). By
default, Druid does not restrict the amount of numbers in an IN filter,
although this situation may block other queries from running. Set this property
to a smaller value to prevent Druid from running queries that have
prohibitively long segment processing times. The optimal limit requires some
trial and error; we recommend starting with 100. Users who submit a query that
exceeds the limit of `maxNumericInFilters` should instead rewrite their queries
to use strings in the IN filter instead of numbers. For example, `WHERE
someString IN (‘123’, ‘456’)`. This value cannot exceed the set system
configuration `druid.sql.planner.maxNumericInFilters`. This value is ignored if
`druid.sql.planner.maxNumericInFilters` is not set explicitly.|
Review comment:
Docs should include a note of how this field interacts with the system
property `maxNumericInFilters`
##########
File path:
sql/src/main/java/org/apache/druid/sql/calcite/planner/PlannerConfig.java
##########
@@ -190,6 +210,46 @@ public PlannerConfig withOverrides(final Map<String,
Object> context)
return newConfig;
}
+ private int validateMaxNumericInFilters(int queryContextMaxNumericInFilters,
int systemConfigMaxNumericInFilters)
+ {
+ // if maxNumericInFIlters through context == 0 catch exception
+ // else if query context exceeds system set value throw error
+ if (queryContextMaxNumericInFilters == 0) {
+ throw new UOE("[%s] must be greater than 0", CTX_MAX_NUMERIC_IN_FILTERS);
+ } else if (queryContextMaxNumericInFilters >
systemConfigMaxNumericInFilters
+ && systemConfigMaxNumericInFilters != NUM_FILTER_NOT_USED) {
+ throw new UOE(
+ "Expected parameter[%s] cannot exceed system set value of [%d]",
+ CTX_MAX_NUMERIC_IN_FILTERS,
+ systemConfigMaxNumericInFilters
+ );
+ }
+ // if system set value is not present, thereby inferring default of -1
+ if (systemConfigMaxNumericInFilters == NUM_FILTER_NOT_USED) {
+ return systemConfigMaxNumericInFilters;
Review comment:
I think you want this to return the queryContextMaxNumericInFilters in
this case. The reason I think this is a desired behavior is because it allows a
sys admin to test what a reasonable default would be for the cluster by using
the query context. Once the sys admin sets the maxNumericInFilters to some
value, then having the system config be an upper limit makes sense. This should
be documented in `configuration/index.md` and `querying/query-context.md`
```suggestion
return queryContextMaxNumericInFilters;
```
##########
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##########
@@ -6903,6 +6904,67 @@ public void testMaxSubqueryRows() throws Exception
);
}
+ @Test
+ public void testZeroMaxNumericInFilter() throws Exception
+ {
+ expectedException.expect(UOE.class);
+ expectedException.expectMessage("[maxNumericInFilters] must be greater
than 0");
+
+ testQuery(
+ PLANNER_CONFIG_DEFAULT,
+ ImmutableMap.of(QueryContexts.MAX_NUMERIC_IN_FILTERS, 0),
Review comment:
Can we add a test for a non 0 query context with the planner config
default to address my comment in PlannerConfig
##########
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##########
@@ -6903,6 +6904,67 @@ public void testMaxSubqueryRows() throws Exception
);
}
+ @Test
+ public void testZeroMaxNumericInFilter() throws Exception
+ {
+ expectedException.expect(UOE.class);
+ expectedException.expectMessage("[maxNumericInFilters] must be greater
than 0");
+
+ testQuery(
+ PLANNER_CONFIG_DEFAULT,
+ ImmutableMap.of(QueryContexts.MAX_NUMERIC_IN_FILTERS, 0),
+ "SELECT COUNT(*)\n"
+ + "FROM druid.numfoo\n"
+ + "WHERE dim6 IN (\n"
+ + "1,2,3\n"
+ + ")\n",
+ CalciteTests.REGULAR_USER_AUTH_RESULT,
+ ImmutableList.of(),
+ ImmutableList.of()
+ );
+ }
+
+ @Test
+ public void testHighestMaxNumericInFilter() throws Exception
+ {
+ expectedException.expect(UOE.class);
+ expectedException.expectMessage("Expected parameter[maxNumericInFilters]
cannot exceed system set value of [100]");
+
+ testQuery(
+ PLANNER_CONFIG_MAX_NUMERIC_IN_FILTER,
+ ImmutableMap.of(QueryContexts.MAX_NUMERIC_IN_FILTERS, 20000),
+ "SELECT COUNT(*)\n"
+ + "FROM druid.numfoo\n"
+ + "WHERE dim6 IN (\n"
+ + "1,2,3\n"
+ + ")\n",
+ CalciteTests.REGULAR_USER_AUTH_RESULT,
+ ImmutableList.of(),
+ ImmutableList.of()
+ );
+ }
+
+ @Test
+ public void testQueryWithMoreThanMaxNumericInFilter() throws Exception
+ {
+ expectedException.expect(UOE.class);
+ expectedException.expectMessage("The number of values in the IN clause for
[dim6] in query exceeds configured maxNumericFilter limit of [2] for INs. Cast
[3] values of IN clause to String");
+
+ testQuery(
+ PLANNER_CONFIG_MAX_NUMERIC_IN_FILTER,
+ ImmutableMap.of(QueryContexts.MAX_NUMERIC_IN_FILTERS, 2),
+ "SELECT COUNT(*)\n"
+ + "FROM druid.numfoo\n"
+ + "WHERE dim6 IN (\n"
Review comment:
I think we'll want some additional tests here for more complex filters -
Some examples:
* ORs of multiple dimensions eg. something like dim1 = 1 OR dim2 =2 or dim3
= 3
* OR filter several levels deep eg. dim1='one' AND dim2 IN (1,2,3,4,5)
* NOT IN filter eg. `dim6 NOT IN (...)`
##########
File path:
sql/src/main/java/org/apache/druid/sql/calcite/run/NativeQueryMaker.java
##########
@@ -122,6 +127,32 @@ public boolean feature(QueryFeature feature)
);
}
}
+ int numFilters =
plannerContext.getPlannerConfig().getMaxNumericInFilters();
+
+ // special corner case handling for numeric IN filters
+ // in case of query containing IN (v1, v2, v3,...) where Vi is numeric
+ // a BoundFilter is created internally for each of the values
+ // whereas when Vi s are String the Filters are converted as BoundFilter
to SelectorFilter to InFilter
+ // which takes lesser processing for bitmaps
+ // So in a case where user executes a query with multiple numeric INs,
flame graph shows BoundFilter.getBitmapResult
+ // and BoundFilter.match predicate eating up processing time which stalls
a historical for a query with large number
+ // of numeric INs (> 10K). In such cases user should change the query to
specify the IN clauses as String
+ // Instead of IN(v1,v2,v3) user should specify IN('v1','v2','v3')
+ if (numFilters != PlannerConfig.NUM_FILTER_NOT_USED) {
+ if (query.getFilter() instanceof OrDimFilter) {
Review comment:
This looks like it would only supports filtering on a top level OR
filter. However if a filter on a query is nested, then this guardrail would not
be effective. For example what if the filter was
```
dim1='one' AND dim2 IN (1,2,3,4,5)
```
In this case, it looks like the query would still run.
Do you think walking the entire filter tree would be expensive at this point
in the query plan? The other thing I'd like to know more about is how
`DimFilter#optimize` plays into this guardrail - is the query filter that is
being used here before or after the optimize function is called?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]