[GitHub] [spark] peter-toth commented on a change in pull request #31848: [SPARK-33482][SPARK-34756][SQL] Fix FileScan equality check

GitBox Fri, 19 Mar 2021 11:13:29 -0700


peter-toth commented on a change in pull request #31848:
URL: https://github.com/apache/spark/pull/31848#discussion_r597885016




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
##########
@@ -84,11 +85,25 @@ trait FileScan extends Scan
 
   protected def seqToString(seq: Seq[Any]): String = seq.mkString("[", ", ", 
"]")
 
+  private lazy val (normalizedPartitionFilters, normalizedDataFilters) = {
+    val output = readSchema().toAttributes.map(a => 
a.withName(normalizeName(a.name)))

Review comment:
       I see your point and agree that name is that matters in these `Filter` 
like `Expressions` but if we go this way then I think:
   - we also need to clear other properties of `AttributeReference`s like 
`qualifier`
   - we need to either explicitly sort `partitionFilters` and `dataFilters` 
expression lists (probably with `.sortBy(_.hashCode())`) to make sure they 
match with `f.partitionFilters` and `f.dataFilters`, or use 
`Set(partitionFilters) == Set(f.partitionFilters)` because we can't use 
`ExpressionSet(partitionFilters) == ExpressionSet(f.partitionFilters)` as we 
removed all expr ids
   - we need to reorder all descendants of each `partitionFilters` and 
`dataFilters` expression (with `Canonicalize.expressionReorder()` to make sure 
like `id = 1` matches with `1 = id` (and `Canonicalize.ignoreTimeZone()` also 
needs to be applied)
   
   And just a side note that I think we could do most of the above at 
https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L120-L121
 before `withFilters()` and then `FileScan.equals()` became very simple.
   
   But I wonder all these changes are simpler than the current PR?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] peter-toth commented on a change in pull request #31848: [SPARK-33482][SPARK-34756][SQL] Fix FileScan equality check

Reply via email to