cshuo commented on code in PR #19006:
URL: https://github.com/apache/hudi/pull/19006#discussion_r3432526625


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -511,6 +511,15 @@ public class FlinkOptions extends HoodieConfig {
           + "E.g., given query: SELECT * FROM T WHERE `uuid` IN 
(1,2,3,4,5,6,7,8,9), the number of hoodie keys is 9, and\n"
           + "the maximum value is 8, so the source will not perform record 
level index filtering.");
 
+  @AdvancedConfig
+  public static final ConfigOption<Integer> 
READ_DATA_SKIPPING_RLI_PARTITIONS_MAX_NUM = ConfigOptions
+      .key("read.data.skipping.rli.partitions.max.num")
+      .intType()
+      .defaultValue(3)
+      .withDescription("The maximum number of candidate data table partitions 
that can be queried through the partitioned record level index "

Review Comment:
   The threshold guards the partitioned-RLI lookup cost. Unlike the global RLI 
(a single lookup over all keys), the partitioned variant has to do a 
per-partition metadata-table read. When a query doesn't filter on the partition 
column, the candidate set can span a large number of data partitions, and we'd 
fan out an RLI lookup to each one, which will lead to more extra cost. So once 
the candidate partition count exceeds the threshold we just skip pruning and 
return all file slices (correct results, no extra lookups).
   
   Yeah, we can use a hardcoded threshold for the first release, since it's not 
a config for users to tune, and more like a safe guard. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to