codope commented on code in PR #10352:
URL: https://github.com/apache/hudi/pull/10352#discussion_r1519215971
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -346,6 +352,12 @@ case class HoodieFileIndex(spark: SparkSession,
Option.empty
} else if (recordKeys.nonEmpty) {
Option.apply(recordLevelIndex.getCandidateFiles(getAllFiles(),
recordKeys))
+ } else if (recordKeys.nonEmpty && partitionStatsIndex.isIndexAvailable &&
!queryFilters.isEmpty) {
+ val prunedFileNames = getPrunedFileNames(prunedPartitionsAndFileSlices)
+ val shouldReadInMemory = partitionStatsIndex.shouldReadInMemory(this,
queryReferencedColumns)
Review Comment:
First thing I want to clarify is that partition stats are collected only
when column stats is enabled, and for only those columns for which column stats
is enabled.
Write path: When user will do `.partitionBy("a,b,c")`, then the logic is
similar to column stats. We use the commit metadata and convert that to
partition stats. This happens in
`HoodieTableMetadataUtil.convertMetadataToPartitionStatsRecords`. The
difference from column stats is that the stats are aggregated by partition
value in `BaseFileUtils.getColumnRangeInPartition`.
Read path: `queryReferencedColumns` here contain data filters. Partition
pruning based on partition filters has already happened one level above.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]