Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

via GitHub Thu, 18 Jan 2024 20:00:28 -0800


stream2000 commented on code in PR #10528:
URL: https://github.com/apache/hudi/pull/10528#discussion_r1458286332



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -361,9 +364,16 @@ case class HoodieFileIndex(spark: SparkSession,
       //       For that we use a simple-heuristic to determine whether we 
should read and process CSI in-memory or
       //       on-cluster: total number of rows of the expected projected 
portion of the index has to be below the
       //       threshold (of 100k records)
-      val prunedFileNames = getPrunedFileNames(prunedPartitionsAndFileSlices)
       val shouldReadInMemory = columnStatsIndex.shouldReadInMemory(this, 
queryReferencedColumns)
-      columnStatsIndex.loadTransposed(queryReferencedColumns, 
shouldReadInMemory, prunedFileNames) { transposedColStatsDF =>
+      val prunedFileNames = getPrunedFileNames(prunedPartitionsAndFileSlices)
+      // NOTE: This judgment has two purposes:

Review Comment:
   nit: We can simplify the comment to: 
   
   // If partition pruning doesn't prune any files, then there's no need to 
apply file filters when loading the Column Statistics Index



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -233,8 +233,9 @@ case class HoodieFileIndex(spark: SparkSession,
       //    - Col-Stats Index is present
       //    - Record-level Index is present
       //    - List of predicates (filters) is present
+      val shouldPushDownFilesFilter = !partitionFilters.isEmpty
       val candidateFilesNamesOpt: Option[Set[String]] =
-      lookupCandidateFilesInMetadataTable(dataFilters, 
prunedPartitionsAndFileSlices) match {
+      lookupCandidateFilesInMetadataTable(dataFilters, 
shouldPushDownFilesFilter, prunedPartitionsAndFileSlices) match {

Review Comment:
   We can move the `shouldPushDownFilesFilter` to the end of the parameter list.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

Reply via email to