cloud-fan commented on a change in pull request #31848:
URL: https://github.com/apache/spark/pull/31848#discussion_r597685965
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
##########
@@ -84,11 +85,25 @@ trait FileScan extends Scan
protected def seqToString(seq: Seq[Any]): String = seq.mkString("[", ", ",
"]")
+ private lazy val (normalizedPartitionFilters, normalizedDataFilters) = {
+ val output = readSchema().toAttributes.map(a =>
a.withName(normalizeName(a.name)))
Review comment:
Thinking about it again, the `FileScan` equality already considers
`fileIndex` and `readSchema`, which means 2 file scans only equal to each other
if they read the same set of files and the same set of columns.
Given that, I think the expr IDs do not matter for filters, only the column
name matters. For normal v2 sources, they use `Filter` not `Expression`, which
do not have expr IDs either.
The data/partition filters are created in `PruneFileSourcePartitions`(see
https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L51),
and the column names inside filters are already normalized w.r.t. the actual
file scan output schema, so we don't need to consider case sensitivity here.
That said, I think the normalize logic here should be very simple: just turn
expr IDs to 0.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]