[GitHub] [spark] cloud-fan commented on a change in pull request #31848: [SPARK-33482][SPARK-34756][SQL] Fix FileScan equality check

GitBox Fri, 19 Mar 2021 06:41:19 -0700


cloud-fan commented on a change in pull request #31848:
URL: https://github.com/apache/spark/pull/31848#discussion_r597685965




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
##########
@@ -84,11 +85,25 @@ trait FileScan extends Scan
 
   protected def seqToString(seq: Seq[Any]): String = seq.mkString("[", ", ", 
"]")
 
+  private lazy val (normalizedPartitionFilters, normalizedDataFilters) = {
+    val output = readSchema().toAttributes.map(a => 
a.withName(normalizeName(a.name)))

Review comment:
       Thinking about it again, the `FileScan` equality already considers 
`fileIndex` and `readSchema`, which means 2 file scans only equal to each other 
if they read the same set of files and the same set of columns.
   
   Given that, I think the expr IDs do not matter for filters, only the column 
name matters. For normal v2 sources, they use `Filter` not `Expression`, which 
do not have expr IDs either.
   
   The data/partition filters are created in `PruneFileSourcePartitions`(see 
https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L51),
 and the column names inside filters are already normalized w.r.t. the actual 
file scan output schema, so we don't need to consider case sensitivity here.
   
   That said, I think the normalize logic here should be very simple: just turn 
expr IDs to 0.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a change in pull request #31848: [SPARK-33482][SPARK-34756][SQL] Fix FileScan equality check

Reply via email to