yihua commented on code in PR #18126:
URL: https://github.com/apache/hudi/pull/18126#discussion_r3250346252


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -167,19 +167,49 @@ case class HoodieFileIndex(spark: SparkSession,
   /**
    * Invoked by Spark to fetch list of latest base files per partition.
    *
-   * @param partitionFilters partition column filters
+   * NOTE: For tables with nested partition columns (e.g. 
`nested_record.level`), Spark's
+   * [[org.apache.spark.sql.execution.datasources.FileSourceScanExec]] uses 
standard attribute-name
+   * matching when splitting filters into partition vs. data filters. Because 
the filter expression
+   * for `nested_record.level = 'INFO'` is represented as
+   * `GetStructField(AttributeReference("nested_record"), …)` — whose 
reference is the *struct*
+   * attribute `nested_record`, not the flat partition attribute 
`nested_record.level` — Spark
+   * classifies it as a data filter.  This means `partitionFilters` arrives 
here empty and
+   * `dataFilters` contains the nested-field predicate.  We re-split the 
combined set of filters
+   * below so that predicates whose only references are struct-parents of 
partition columns are
+   * treated as partition filters, matching the behaviour of 
[[HoodiePruneFileSourcePartitions]].
+   *
+   * @param partitionFilters partition column filters (may be incomplete for 
nested columns)
    * @param dataFilters      data columns filters
    * @return list of PartitionDirectory containing partition to base files 
mapping
    */
   override def listFiles(partitionFilters: Seq[Expression], dataFilters: 
Seq[Expression]): Seq[PartitionDirectory] = {
-    val slices = filterFileSlices(dataFilters, partitionFilters).flatMap(
+    val (actualPartitionFilters, actualDataFilters) =
+      reclassifyFiltersForNestedPartitionColumns(partitionFilters, dataFilters)

Review Comment:
   Got it.  So we have to go with the current mitigation approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to