Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

via GitHub Sun, 10 Mar 2024 23:42:28 -0700


codope commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1519215971



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -346,6 +352,12 @@ case class HoodieFileIndex(spark: SparkSession,
       Option.empty
     } else if (recordKeys.nonEmpty) {
       Option.apply(recordLevelIndex.getCandidateFiles(getAllFiles(), 
recordKeys))
+    } else if (recordKeys.nonEmpty && partitionStatsIndex.isIndexAvailable && 
!queryFilters.isEmpty) {
+      val prunedFileNames = getPrunedFileNames(prunedPartitionsAndFileSlices)
+      val shouldReadInMemory = partitionStatsIndex.shouldReadInMemory(this, 
queryReferencedColumns)

Review Comment:
   First thing I want to clarify is that partition stats are collected only 
when column stats is enabled, and for only those columns for which column stats 
is enabled.
   
   Write path: When user will do `.partitionBy("a,b,c")`, then the logic is 
similar to column stats. We use the commit metadata  and convert that to 
partition stats. This happens in 
`HoodieTableMetadataUtil.convertMetadataToPartitionStatsRecords`. The 
difference from column stats is that the stats are aggregated by partition 
value in `BaseFileUtils.getColumnRangeInPartition`.
   
   Read path: `queryReferencedColumns` here contain data filters. Partition 
pruning based on partition filters has already happened one level above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7144] Build storage partition stats index and use it for data skipping [hudi]

Reply via email to