andygrove opened a new issue, #3477:
URL: https://github.com/apache/datafusion-comet/issues/3477

   ## Description
   
   Timestamp truncation (`date_trunc` / `TruncTimestamp`) for sub-day formats 
(HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND) can be optimized using simple 
modular arithmetic instead of the current approach that converts to/from 
`DateTime<Tz>` via chrono.
   
   Since timestamps are stored internally as microseconds since Unix epoch 
(UTC), truncation to e.g. SECOND can be computed as:
   
   ```rust
   micros - (micros % MICROS_PER_SECOND)
   ```
   
   This avoids the overhead of timezone-aware `DateTime` construction and is 
significantly faster.
   
   ## Important: timezone offset considerations
   
   As noted by @parthchandra in 
https://github.com/apache/datafusion-comet/pull/2996#issuecomment-3725094939, 
the arithmetic fast path is **not timezone-independent for HOUR** (and 
theoretically MINUTE).
   
   The raw arithmetic truncates the **UTC** hour, not the **local** hour. For 
timezones with non-whole-hour offsets, this produces incorrect results. For 
example:
   
   - `Asia/Kathmandu` (UTC+5:45): truncating `14:00 UTC` to HOUR gives `14:00 
UTC` (= 19:45 local), but the correct local-hour truncation would be `19:00 
local` = `13:15 UTC`
   - `Asia/Kolkata` (UTC+5:30), `Australia/Eucla` (UTC+8:45), and others with 
:30/:45 offsets have the same issue
   
   ### Safe fast-path formats
   
   - **MICROSECOND** — always safe (no-op for microsecond timestamps)
   - **MILLISECOND** — always safe (sub-second, no timezone has sub-second 
offset)
   - **SECOND** — always safe (no timezone has sub-second offset)
   - **MINUTE** — safe in practice (no modern timezone has sub-minute offset, 
though some historical ones did)
   - **HOUR** — **NOT safe** for non-whole-hour timezone offsets
   
   ### Suggested approach
   
   One of:
   1. **Only apply arithmetic fast path for SECOND, MILLISECOND, MICROSECOND** 
— safest, still a meaningful optimization
   2. **Apply fast path for HOUR/MINUTE only when the timezone offset is a 
whole-hour/whole-minute multiple** — check the offset at runtime and fall back 
to the slow path otherwise
   3. **Adjust arithmetic to account for timezone offset** — add offset before 
truncation, subtract after. This requires resolving the timezone offset (which 
may vary due to DST), so it may negate the performance benefit for HOUR
   
   The prior PR (#2996) also included optimizations for `date_trunc` (Date32 
truncation) that work directly with days-since-epoch instead of converting 
through `NaiveDateTime`. Those optimizations are not affected by this timezone 
issue and could be submitted separately.
   
   ## Benchmark results (from prior PR #2996)
   
   The arithmetic approach showed significant speedup for sub-day formats:
   
   ```
   Timestamp Truncate - HOUR:        ~2-3x faster
   Timestamp Truncate - MINUTE:      ~2-3x faster  
   Timestamp Truncate - SECOND:      ~2-3x faster
   Timestamp Truncate - MILLISECOND: ~2-3x faster
   Timestamp Truncate - MICROSECOND: ~10x faster (no-op)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to