devanbenz commented on issue #5827: URL: https://github.com/apache/arrow-rs/issues/5827#issuecomment-2346920702
In lieu of https://github.com/apache/datafusion/issues/12218 I'm going to bring this thread back from the dead π». I'll outline my understanding of the problem as well as what I understand within the current Apache Arrow specification and offer a potential solution. The TL;DR is that we change nothing since we are in line with the current Apache Arrow expectations and datafusion functions similar to a different system that uses Arrow--Clickhouse it is consistent. This proposal is under the assumption that we desire to be completely in line with the current Apache Arrow specification though. ## The problem and potential slated proposal There currently is a misalignment with how casting from `Timestamp('time/zone') -> Timestamp(None)` functions. There is an expectation that this cast should effectively "retain" the timezone, for example: `2033-05-18T08:33:20+01:00` -> `2033-05-18T08:33:20` As of right now the [arrow](https://github.com/apache/arrow) spec [here](https://github.com/apache/arrow/blob/0c2891da304e6740abf3b5d3d3c2b88b10b5aba6/format/Schema.fbs#L324) has indicated that the current state of casting should be like so: `2033-05-18T08:33:20+01:00` -> `2033-05-18T07:33:20` Currently there is a proposal that when casting a ISO style timestamp with a timezone to the `Timestamp` data type it should not adjust the actual date time. @tustvold shared this code snippet that outlines the expectation/proposal pretty well. ``` let a = StringArray::from_iter_values([ "2033-05-18T08:33:20", "2033-05-18T08:33:20Z", "2033-05-18T08:33:20 +01:00", ]); let no_timezone = cast(&a, &DataType::Timestamp(TimeUnit::Nanosecond, None)).unwrap(); let back = cast(&no_timezone, &DataType::Utf8).unwrap(); assert_eq!( back.as_string::<i32>(), &StringArray::from_iter_values([ "2033-05-18T08:33:20", "2033-05-18T08:33:20", "2033-05-18T07:33:20" <---- this issue is proposing changing this to 2033-05-18T08:33:20 ]) ); ``` ## External research *Per @Omega359's comment* > Has anyone looked into what the other arrow implementations do to handle this ambiguous portion of the spec? Currently when running the following query in `Clickhouse` which uses `arrow` the outcome is similar to how `arrow-rs` is treating this cast (more below on how clickhouse can handle it): In clickhouse (arrow): ``` :) SELECT toTimeZone('2024-09-10 11:45:19'::timestamp, 'America/Denver'); ββtoTimeZone(CAST('2024-09-10 11:45:19', 'timestamp'), 'America/Denver')ββ 1. β 2024-09-10 10:45:19 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ :) SELECT toTimeZone('2024-09-10 11:45:19'::timestamp, 'America/Denver')::timestamp; ββCAST(toTimeZone(CAST('2024-09-10 11:45:19', 'timestamp'), 'America/Denver'), 'timestamp')ββ 1. β 2024-09-10 11:45:19 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ``` In datafusion (arrow-rs): ``` > select ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver'); +-----------------------------+ | Utf8("2024-09-10 11:45:19") | +-----------------------------+ | 2024-09-10T11:45:19-06:00 | +-----------------------------+ 1 row(s) fetched. Elapsed 0.048 seconds. > select ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver')::timestamp; +-----------------------------+ | Utf8("2024-09-10 11:45:19") | +-----------------------------+ | 2024-09-10T17:45:19 | +-----------------------------+ 1 row(s) fetched. Elapsed 0.006 seconds. ``` *Note: Datafusion uses UTC as the local time (more on this later).* When taking a look at Postgres I see the *opposite* functionality: ``` postgres=# SELECT ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver'); timezone --------------------- 2024-09-10 10:45:19 (1 row) postgres=# SELECT ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver')::timestamp; timezone --------------------- 2024-09-10 10:45:19 (1 row) ``` The timezone is effectively retained. --- I filed a ticket within clickhouse to ask about this functionality: https://github.com/ClickHouse/ClickHouse/issues/69512 I was met with the following comment: > To clarify the current behaviour: The types DateTime <> DateTime('America/Denver'). It's not stripping, it kinda cast to another type. There are a few tickets I've seen scattered throughout Arrow's issues that sort of relate to this functionality (as far as I'm aware): - https://github.com/apache/arrow/issues/41382 - https://github.com/apache/arrow/issues/41268 Most of them are closed with no indication that this will be adjusted. ## Proposal Even though I personally think that the way Postgres does this is "more correct" that is just a matter of opinion. For consistency sake it would be best to keep the functionality as is so it is in line with Arrow's spec. If someone were to be a consumer of a different product using arrow and migrating over/using a product that uses arrow-rs it would likely make the most sense to have the same functionality. I've also added a proposal to the original issue that led me to this thread: https://github.com/apache/datafusion/issues/12218 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
