yihua commented on code in PR #18126:
URL: https://github.com/apache/hudi/pull/18126#discussion_r3250346252
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -167,19 +167,49 @@ case class HoodieFileIndex(spark: SparkSession,
/**
* Invoked by Spark to fetch list of latest base files per partition.
*
- * @param partitionFilters partition column filters
+ * NOTE: For tables with nested partition columns (e.g.
`nested_record.level`), Spark's
+ * [[org.apache.spark.sql.execution.datasources.FileSourceScanExec]] uses
standard attribute-name
+ * matching when splitting filters into partition vs. data filters. Because
the filter expression
+ * for `nested_record.level = 'INFO'` is represented as
+ * `GetStructField(AttributeReference("nested_record"), …)` — whose
reference is the *struct*
+ * attribute `nested_record`, not the flat partition attribute
`nested_record.level` — Spark
+ * classifies it as a data filter. This means `partitionFilters` arrives
here empty and
+ * `dataFilters` contains the nested-field predicate. We re-split the
combined set of filters
+ * below so that predicates whose only references are struct-parents of
partition columns are
+ * treated as partition filters, matching the behaviour of
[[HoodiePruneFileSourcePartitions]].
+ *
+ * @param partitionFilters partition column filters (may be incomplete for
nested columns)
* @param dataFilters data columns filters
* @return list of PartitionDirectory containing partition to base files
mapping
*/
override def listFiles(partitionFilters: Seq[Expression], dataFilters:
Seq[Expression]): Seq[PartitionDirectory] = {
- val slices = filterFileSlices(dataFilters, partitionFilters).flatMap(
+ val (actualPartitionFilters, actualDataFilters) =
+ reclassifyFiltersForNestedPartitionColumns(partitionFilters, dataFilters)
Review Comment:
Got it. So we have to go with the current mitigation approach.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]