cloud-fan commented on code in PR #55610:
URL: https://github.com/apache/spark/pull/55610#discussion_r3192923867


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala:
##########
@@ -485,6 +485,43 @@ object DateTimeUtils extends SparkDateTimeUtils {
     instantToMicros(truncated.toInstant)
   }
 
+  /**
+   * Fast path for truncating to MINUTE/HOUR/DAY using offset arithmetic 
instead of
+   * allocating a `ZonedDateTime` per row. The offset is resolved once for 
`micros`; the
+   * truncation then runs as `floorMod` in local time. We fall back to 
[[truncToUnit]] when
+   * the offset at the candidate truncated instant differs from the offset at 
`micros`,
+   * which means the truncation crosses a DST/historical transition and the 
local-time
+   * alignment we computed is no longer valid (see SPARK-30766/30857). The 
check is
+   * skipped for fixed-offset zones. Sub-minute offsets (e.g. 
America/Los_Angeles LMT
+   * -07:52:58, see SPARK-33404) and 30/45-minute offsets (Asia/Kolkata 
+05:30, Asia/Kathmandu
+   * +05:45) are handled correctly by this path because the offset is applied 
as part of
+   * the arithmetic; no offset-alignment guard is needed.

Review Comment:
   The algorithm depends on `unitMicros` dividing `MICROS_PER_DAY` — that's 
what makes wall-clock unit boundaries align to multiples of `unitMicros` in the 
shifted-local frame (since epoch is itself a unit boundary). It's true for all 
three call sites today, but if someone later extends `truncToUnitFast` to a 
non-dividing unit (or a unit that isn't fixed-width in micros) it would 
silently misalign. Worth one sentence in the Scaladoc, e.g.:
   
   > `unitMicros` must evenly divide `MICROS_PER_DAY`; otherwise wall-clock 
unit boundaries do not align to multiples of `unitMicros` in the shifted-local 
frame and the `floorMod` truncation is unsafe.



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala:
##########
@@ -766,6 +766,97 @@ class DateTimeUtilsSuite extends SparkFunSuite with 
Matchers with SQLHelper {
     }
   }
 
+  test("truncTimestamp with sub-hour zone offsets") {
+    // Asia/Kolkata (+05:30) and Asia/Kathmandu (+05:45) are not aligned to 
HOUR in UTC.
+    // The fast path applies the offset as part of its arithmetic, so HOUR/DAY 
truncation
+    // produces the correct local-aligned result without needing the slow path.
+    val kolkata = getZoneId("Asia/Kolkata")
+    val ts = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-01-15T09:42:17.123456+05:30"), kolkata).get
+    val expectedHour = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-01-15T09:00:00+05:30"), kolkata).get
+    assert(DateTimeUtils.truncTimestamp(ts, DateTimeUtils.TRUNC_TO_HOUR, 
kolkata) === expectedHour)
+    val expectedDay = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-01-15T00:00:00+05:30"), kolkata).get
+    assert(DateTimeUtils.truncTimestamp(ts, DateTimeUtils.TRUNC_TO_DAY, 
kolkata) === expectedDay)
+
+    val kathmandu = getZoneId("Asia/Kathmandu")
+    val ts2 = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-01-15T09:42:17.123456+05:45"), kathmandu).get
+    val expectedHour2 = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-01-15T09:00:00+05:45"), kathmandu).get
+    assert(DateTimeUtils.truncTimestamp(
+      ts2, DateTimeUtils.TRUNC_TO_HOUR, kathmandu) === expectedHour2)
+  }
+
+  test("truncTimestamp across DST transitions") {
+    val la = getZoneId("America/Los_Angeles")
+    // Spring-forward in LA: 2024-03-10 02:00 PDT does not exist; 02:30 local 
maps to
+    // 2024-03-10 03:30 PDT in wall-clock terms. Use an instant just after the 
transition
+    // so HOUR/DAY truncation candidate falls into the pre-transition offset 
window.
+    val postSpring = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-03-10T03:30:00-07:00"), la).get
+    val expectedHour = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-03-10T03:00:00-07:00"), la).get
+    assert(DateTimeUtils.truncTimestamp(postSpring, 
DateTimeUtils.TRUNC_TO_HOUR, la)
+      === expectedHour)
+    val expectedDay = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-03-10T00:00:00-08:00"), la).get
+    assert(DateTimeUtils.truncTimestamp(postSpring, 
DateTimeUtils.TRUNC_TO_DAY, la)
+      === expectedDay)
+
+    // Fall-back in LA: 2024-11-03 01:30 occurs twice. Truncation to HOUR/DAY 
should
+    // produce the same wall-clock boundary as the slow path regardless.
+    val postFall = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-11-03T01:30:00-08:00"), la).get
+    val expectedHour2 = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-11-03T01:00:00-08:00"), la).get
+    assert(DateTimeUtils.truncTimestamp(postFall, DateTimeUtils.TRUNC_TO_HOUR, 
la)
+      === expectedHour2)
+    val expectedDay2 = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-11-03T00:00:00-07:00"), la).get
+    assert(DateTimeUtils.truncTimestamp(postFall, DateTimeUtils.TRUNC_TO_DAY, 
la)
+      === expectedDay2)
+  }
+
+  test("SPARK-30766/30857: truncTimestamp before the epoch in HOUR/DAY") {

Review Comment:
   The existing `SPARK-33404` test at line 692 loops over `ALL_TIMEZONES` with 
the 1769-10-17 LMT input but only exercises 
`TRUNC_TO_MINUTE`/`SECOND`/`MILLI`/`MICRO`. With this PR, `TRUNC_TO_HOUR` and 
`TRUNC_TO_DAY` also flow through the same offset-arithmetic path, and the LMT 
(sub-minute offset) case is exactly the bug class that motivated the slow path. 
The new test here only covers LA at 1960 (no LMT). Adding two lines to the 
SPARK-33404 loop would extend coverage to every available zone at near-zero 
cost — the most direct regression guard against re-introducing the SPARK-33404 
bug:
   
   ```scala
   testTrunc(DateTimeUtils.TRUNC_TO_HOUR, "1769-10-17T17:00:00", inputTS.get, 
zid)
   testTrunc(DateTimeUtils.TRUNC_TO_DAY, "1769-10-17T00:00:00", inputTS.get, 
zid)
   ```



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala:
##########
@@ -766,6 +766,97 @@ class DateTimeUtilsSuite extends SparkFunSuite with 
Matchers with SQLHelper {
     }
   }
 
+  test("truncTimestamp with sub-hour zone offsets") {
+    // Asia/Kolkata (+05:30) and Asia/Kathmandu (+05:45) are not aligned to 
HOUR in UTC.
+    // The fast path applies the offset as part of its arithmetic, so HOUR/DAY 
truncation
+    // produces the correct local-aligned result without needing the slow path.
+    val kolkata = getZoneId("Asia/Kolkata")
+    val ts = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-01-15T09:42:17.123456+05:30"), kolkata).get
+    val expectedHour = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-01-15T09:00:00+05:30"), kolkata).get
+    assert(DateTimeUtils.truncTimestamp(ts, DateTimeUtils.TRUNC_TO_HOUR, 
kolkata) === expectedHour)
+    val expectedDay = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-01-15T00:00:00+05:30"), kolkata).get
+    assert(DateTimeUtils.truncTimestamp(ts, DateTimeUtils.TRUNC_TO_DAY, 
kolkata) === expectedDay)
+
+    val kathmandu = getZoneId("Asia/Kathmandu")
+    val ts2 = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-01-15T09:42:17.123456+05:45"), kathmandu).get
+    val expectedHour2 = DateTimeUtils.stringToTimestamp(
+      UTF8String.fromString("2024-01-15T09:00:00+05:45"), kathmandu).get
+    assert(DateTimeUtils.truncTimestamp(
+      ts2, DateTimeUtils.TRUNC_TO_HOUR, kathmandu) === expectedHour2)
+  }
+
+  test("truncTimestamp across DST transitions") {
+    val la = getZoneId("America/Los_Angeles")
+    // Spring-forward in LA: 2024-03-10 02:00 PDT does not exist; 02:30 local 
maps to
+    // 2024-03-10 03:30 PDT in wall-clock terms. Use an instant just after the 
transition
+    // so HOUR/DAY truncation candidate falls into the pre-transition offset 
window.

Review Comment:
   Minor: "2024-03-10 02:00 PDT does not exist" mislabels the gap. It's the 
wall-clock reading 02:00–02:59 that doesn't exist on 2024-03-10 in LA, 
regardless of label; `2024-03-10T02:00-07:00` is itself a well-formed UTC 
instant (08:00 UTC) — one hour before the transition, actually in PST.
   
   ```suggestion
       // Spring-forward in LA: local 02:00-02:59 doesn't exist on 2024-03-10
       // (01:59 PST jumps to 03:00 PDT). Pick 03:30 PDT just after the 
transition
       // so the HOUR/DAY truncation candidate falls into the pre-transition 
window.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to