[GitHub] [spark] rdblue commented on a change in pull request #28993: [SPARK-32168][SQL] Fix hidden partitioning correctness bug in SQL overwrite

GitBox Wed, 08 Jul 2020 10:14:14 -0700


rdblue commented on a change in pull request #28993:
URL: https://github.com/apache/spark/pull/28993#discussion_r451701120




##########
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala
##########
@@ -78,10 +92,44 @@ class InMemoryTable(
             throw new IllegalArgumentException(s"Unsupported type, 
${dataType.simpleString}")
         }
       } else {
-        value
+        (value, schema(index).dataType)
       }
     }
-    partCols.map(fieldNames => extractor(fieldNames, schema, row))
+
+    partitioning.map {
+      case IdentityTransform(ref) =>
+        extractor(ref.fieldNames, schema, row)._1
+      case YearsTransform(ref) =>
+        extractor(ref.fieldNames, schema, row) match {
+          case (days: Int, DateType) =>
+            ChronoUnit.YEARS.between(EPOCH_LOCAL_DATE, 
DateTimeUtils.daysToLocalDate(days))
+          case (micros: Long, TimestampType) =>
+            val localDate = 
DateTimeUtils.microsToInstant(micros).atZone(UTC).toLocalDate
+            ChronoUnit.YEARS.between(EPOCH_LOCAL_DATE, localDate)

Review comment:
       Yes, @cloud-fan is right.
   
   Spark also doesn't require specific behavior for these. The `days` function 
indicates that data should be broken down into day-sized partitions, but 
doesn't require a specific boundary. It is up to the source to decide where 
those boundaries are.
   
   By the time timestamps get to the source, they are already normalized values 
in microseconds from epoch in UTC. Because we have consistent values, the 
source can divide values into partitions however it wants. A filter like `ts >= 
t1 and ts < t2` gets converted to `part_day >= day(t1) and part_day <= day(t2)` 
no matter what the specific implementation of `day` is.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] rdblue commented on a change in pull request #28993: [SPARK-32168][SQL] Fix hidden partitioning correctness bug in SQL overwrite

Reply via email to