andygrove opened a new issue, #3477: URL: https://github.com/apache/datafusion-comet/issues/3477
## Description Timestamp truncation (`date_trunc` / `TruncTimestamp`) for sub-day formats (HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND) can be optimized using simple modular arithmetic instead of the current approach that converts to/from `DateTime<Tz>` via chrono. Since timestamps are stored internally as microseconds since Unix epoch (UTC), truncation to e.g. SECOND can be computed as: ```rust micros - (micros % MICROS_PER_SECOND) ``` This avoids the overhead of timezone-aware `DateTime` construction and is significantly faster. ## Important: timezone offset considerations As noted by @parthchandra in https://github.com/apache/datafusion-comet/pull/2996#issuecomment-3725094939, the arithmetic fast path is **not timezone-independent for HOUR** (and theoretically MINUTE). The raw arithmetic truncates the **UTC** hour, not the **local** hour. For timezones with non-whole-hour offsets, this produces incorrect results. For example: - `Asia/Kathmandu` (UTC+5:45): truncating `14:00 UTC` to HOUR gives `14:00 UTC` (= 19:45 local), but the correct local-hour truncation would be `19:00 local` = `13:15 UTC` - `Asia/Kolkata` (UTC+5:30), `Australia/Eucla` (UTC+8:45), and others with :30/:45 offsets have the same issue ### Safe fast-path formats - **MICROSECOND** — always safe (no-op for microsecond timestamps) - **MILLISECOND** — always safe (sub-second, no timezone has sub-second offset) - **SECOND** — always safe (no timezone has sub-second offset) - **MINUTE** — safe in practice (no modern timezone has sub-minute offset, though some historical ones did) - **HOUR** — **NOT safe** for non-whole-hour timezone offsets ### Suggested approach One of: 1. **Only apply arithmetic fast path for SECOND, MILLISECOND, MICROSECOND** — safest, still a meaningful optimization 2. **Apply fast path for HOUR/MINUTE only when the timezone offset is a whole-hour/whole-minute multiple** — check the offset at runtime and fall back to the slow path otherwise 3. **Adjust arithmetic to account for timezone offset** — add offset before truncation, subtract after. This requires resolving the timezone offset (which may vary due to DST), so it may negate the performance benefit for HOUR The prior PR (#2996) also included optimizations for `date_trunc` (Date32 truncation) that work directly with days-since-epoch instead of converting through `NaiveDateTime`. Those optimizations are not affected by this timezone issue and could be submitted separately. ## Benchmark results (from prior PR #2996) The arithmetic approach showed significant speedup for sub-day formats: ``` Timestamp Truncate - HOUR: ~2-3x faster Timestamp Truncate - MINUTE: ~2-3x faster Timestamp Truncate - SECOND: ~2-3x faster Timestamp Truncate - MILLISECOND: ~2-3x faster Timestamp Truncate - MICROSECOND: ~10x faster (no-op) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
