[GitHub] [druid] suneet-s commented on a change in pull request #12195: Moving in filter check to broker

GitBox Sun, 06 Feb 2022 13:18:04 -0800


suneet-s commented on a change in pull request #12195:
URL: https://github.com/apache/druid/pull/12195#discussion_r800227896




##########
File path: docs/querying/query-context.md
##########
@@ -62,7 +62,7 @@ Unless otherwise noted, the following parameters apply to all 
query types.
 |secondaryPartitionPruning|`true`|Enable secondary partition pruning on the 
Broker. The Broker will always prune unnecessary segments from the input scan 
based on a filter on time intervals, but if the data is further partitioned 
with hash or range partitioning, this option will enable additional pruning 
based on a filter on secondary partition dimensions.|
 |enableJoinLeftTableScanDirect|`false`|This flag applies to queries which have 
joins. For joins, where left child is a simple scan with a filter,  by default, 
druid will run the scan as a query and the join the results to the right child 
on broker. Setting this flag to true overrides that behavior and druid will 
attempt to push the join to data servers instead. Please note that the flag 
could be applicable to queries even if there is no explicit join. since queries 
can internally translated into a join by the SQL planner.|
 |debug| `false` | Flag indicating whether to enable debugging outputs for the 
query. When set to false, no additional logs will be produced (logs produced 
will be entirely dependent on your logging level). When set to true, the 
following addition logs will be produced:<br />- Log the stack trace of the 
exception (if any) produced by the query |
-
+|maxNumericInFilters|`-1`|Max limit for the amount of numeric values that can 
be compared for a string type dimension when the entire SQL WHERE clause of a 
query translates to an [IN filter](../querying/filters.md#in-filter). By 
default, Druid does not restrict the amount of numbers in an IN filter, 
although this situation may block other queries from running. Set this property 
to a smaller value to prevent Druid from running queries that have 
prohibitively long segment processing times. The optimal limit requires some 
trial and error; we recommend starting with 100.  Users who submit a query that 
exceeds the limit of `maxNumericInFilters` should instead rewrite their queries 
to use strings in the IN filter instead of numbers. For example, `WHERE 
someString IN (‘123’, ‘456’)`. This value cannot exceed the set system 
configuration `druid.sql.planner.maxNumericInFilters`. This value is ignored if 
`druid.sql.planner.maxNumericInFilters` is not set explicitly.|

Review comment:
       Docs should include a note of how this field interacts with the system 
property `maxNumericInFilters`

##########
File path: 
sql/src/main/java/org/apache/druid/sql/calcite/planner/PlannerConfig.java
##########
@@ -190,6 +210,46 @@ public PlannerConfig withOverrides(final Map<String, 
Object> context)
     return newConfig;
   }
 
+  private int validateMaxNumericInFilters(int queryContextMaxNumericInFilters, 
int systemConfigMaxNumericInFilters)
+  {
+    // if maxNumericInFIlters through context == 0 catch exception
+    // else if query context exceeds system set value throw error
+    if (queryContextMaxNumericInFilters == 0) {
+      throw new UOE("[%s] must be greater than 0", CTX_MAX_NUMERIC_IN_FILTERS);
+    } else if (queryContextMaxNumericInFilters > 
systemConfigMaxNumericInFilters
+               && systemConfigMaxNumericInFilters != NUM_FILTER_NOT_USED) {
+      throw new UOE(
+          "Expected parameter[%s] cannot exceed system set value of [%d]",
+          CTX_MAX_NUMERIC_IN_FILTERS,
+          systemConfigMaxNumericInFilters
+      );
+    }
+    // if system set value is not present, thereby inferring default of -1
+    if (systemConfigMaxNumericInFilters == NUM_FILTER_NOT_USED) {
+      return systemConfigMaxNumericInFilters;

Review comment:
       I think you want this to return the queryContextMaxNumericInFilters in 
this case. The reason I think this is a desired behavior is because it allows a 
sys admin to test what a reasonable default would be for the cluster by using 
the query context. Once the sys admin sets the maxNumericInFilters to some 
value, then having the system config be an upper limit makes sense. This should 
be documented in `configuration/index.md` and `querying/query-context.md`
   ```suggestion
         return queryContextMaxNumericInFilters;
   ```

##########
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##########
@@ -6903,6 +6904,67 @@ public void testMaxSubqueryRows() throws Exception
     );
   }
 
+  @Test
+  public void testZeroMaxNumericInFilter() throws Exception
+  {
+    expectedException.expect(UOE.class);
+    expectedException.expectMessage("[maxNumericInFilters] must be greater 
than 0");
+
+    testQuery(
+        PLANNER_CONFIG_DEFAULT,
+        ImmutableMap.of(QueryContexts.MAX_NUMERIC_IN_FILTERS, 0),

Review comment:
       Can we add a test for a non 0 query context with the planner config 
default to address my comment in PlannerConfig

##########
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##########
@@ -6903,6 +6904,67 @@ public void testMaxSubqueryRows() throws Exception
     );
   }
 
+  @Test
+  public void testZeroMaxNumericInFilter() throws Exception
+  {
+    expectedException.expect(UOE.class);
+    expectedException.expectMessage("[maxNumericInFilters] must be greater 
than 0");
+
+    testQuery(
+        PLANNER_CONFIG_DEFAULT,
+        ImmutableMap.of(QueryContexts.MAX_NUMERIC_IN_FILTERS, 0),
+        "SELECT COUNT(*)\n"
+        + "FROM druid.numfoo\n"
+        + "WHERE dim6 IN (\n"
+        + "1,2,3\n"
+        + ")\n",
+        CalciteTests.REGULAR_USER_AUTH_RESULT,
+        ImmutableList.of(),
+        ImmutableList.of()
+    );
+  }
+
+  @Test
+  public void testHighestMaxNumericInFilter() throws Exception
+  {
+    expectedException.expect(UOE.class);
+    expectedException.expectMessage("Expected parameter[maxNumericInFilters] 
cannot exceed system set value of [100]");
+
+    testQuery(
+        PLANNER_CONFIG_MAX_NUMERIC_IN_FILTER,
+        ImmutableMap.of(QueryContexts.MAX_NUMERIC_IN_FILTERS, 20000),
+        "SELECT COUNT(*)\n"
+        + "FROM druid.numfoo\n"
+        + "WHERE dim6 IN (\n"
+        + "1,2,3\n"
+        + ")\n",
+        CalciteTests.REGULAR_USER_AUTH_RESULT,
+        ImmutableList.of(),
+        ImmutableList.of()
+    );
+  }
+
+  @Test
+  public void testQueryWithMoreThanMaxNumericInFilter() throws Exception
+  {
+    expectedException.expect(UOE.class);
+    expectedException.expectMessage("The number of values in the IN clause for 
[dim6] in query exceeds configured maxNumericFilter limit of [2] for INs. Cast 
[3] values of IN clause to String");
+
+    testQuery(
+        PLANNER_CONFIG_MAX_NUMERIC_IN_FILTER,
+        ImmutableMap.of(QueryContexts.MAX_NUMERIC_IN_FILTERS, 2),
+        "SELECT COUNT(*)\n"
+        + "FROM druid.numfoo\n"
+        + "WHERE dim6 IN (\n"

Review comment:
       I think we'll want some additional tests here for more complex filters - 
Some examples:
   * ORs of multiple dimensions eg. something like dim1 = 1 OR dim2 =2 or dim3 
= 3
   * OR filter several levels deep eg. dim1='one' AND dim2 IN (1,2,3,4,5)
   * NOT IN filter eg. `dim6 NOT IN (...)`

##########
File path: 
sql/src/main/java/org/apache/druid/sql/calcite/run/NativeQueryMaker.java
##########
@@ -122,6 +127,32 @@ public boolean feature(QueryFeature feature)
         );
       }
     }
+    int numFilters = 
plannerContext.getPlannerConfig().getMaxNumericInFilters();
+
+    // special corner case handling for numeric IN filters
+    // in case of query containing IN (v1, v2, v3,...) where Vi is numeric
+    // a BoundFilter is created internally for each of the values
+    // whereas when Vi s are String the Filters are converted as BoundFilter 
to SelectorFilter to InFilter
+    // which takes lesser processing for bitmaps
+    // So in a case where user executes a query with multiple numeric INs, 
flame graph shows BoundFilter.getBitmapResult
+    // and BoundFilter.match predicate eating up processing time which stalls 
a historical for a query with large number
+    // of numeric INs (> 10K). In such cases user should change the query to 
specify the IN clauses as String
+    // Instead of IN(v1,v2,v3) user should specify IN('v1','v2','v3')
+    if (numFilters != PlannerConfig.NUM_FILTER_NOT_USED) {
+      if (query.getFilter() instanceof OrDimFilter) {

Review comment:
       This looks like it would only supports filtering on a top level OR 
filter. However if a filter on a query is nested, then this guardrail would not 
be effective. For example what if the filter was 
   
   ```
   dim1='one' AND dim2 IN (1,2,3,4,5)
   ```
   
   In this case, it looks like the query would still run.
   
   Do you think walking the entire filter tree would be expensive at this point 
in the query plan? The other thing I'd like to know more about is how 
`DimFilter#optimize` plays into this guardrail - is the query filter that is 
being used here before or after the optimize function is called? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] suneet-s commented on a change in pull request #12195: Moving in filter check to broker

Reply via email to