devanbenz commented on issue #5827:
URL: https://github.com/apache/arrow-rs/issues/5827#issuecomment-2346920702

   In lieu of https://github.com/apache/datafusion/issues/12218 I'm going to 
bring this thread back from the dead πŸ‘».  I'll outline my understanding of the 
problem as well as what I understand within the current Apache Arrow 
specification and offer a potential solution. 
   
   The TL;DR is that we change nothing since we are in line with the current 
Apache Arrow expectations and datafusion functions similar to a different 
system that uses Arrow--Clickhouse it is consistent. This proposal is under the 
assumption that we desire to be completely in line with the current Apache 
Arrow specification though. 
   
   ## The problem and potential slated proposal
   
   There currently is a misalignment with how casting from 
`Timestamp('time/zone') -> Timestamp(None)` functions. There is an expectation 
that this cast should effectively "retain" the timezone, for example:
   
   `2033-05-18T08:33:20+01:00` -> `2033-05-18T08:33:20`
   
   As of right now the [arrow](https://github.com/apache/arrow) spec 
[here](https://github.com/apache/arrow/blob/0c2891da304e6740abf3b5d3d3c2b88b10b5aba6/format/Schema.fbs#L324)
 has indicated that the current state of casting should be like so:
   
   `2033-05-18T08:33:20+01:00` -> `2033-05-18T07:33:20`
   
   Currently there is a proposal that when casting a ISO style timestamp with a 
timezone to the `Timestamp` data type it should not adjust the actual date 
time. 
   
   @tustvold shared this code snippet that outlines the expectation/proposal 
pretty well. 
   ```
   let a = StringArray::from_iter_values([
       "2033-05-18T08:33:20",
       "2033-05-18T08:33:20Z",
       "2033-05-18T08:33:20 +01:00",
   ]);
   
   let no_timezone = cast(&a, &DataType::Timestamp(TimeUnit::Nanosecond, 
None)).unwrap();
   let back = cast(&no_timezone, &DataType::Utf8).unwrap();
   assert_eq!(
       back.as_string::<i32>(),
       &StringArray::from_iter_values([
           "2033-05-18T08:33:20",
           "2033-05-18T08:33:20",
           "2033-05-18T07:33:20" <---- this issue is proposing changing this to 
2033-05-18T08:33:20
       ])
   );
   ```
   
   ## External research
   
   *Per @Omega359's comment*
   > Has anyone looked into what the other arrow implementations do to handle 
this ambiguous portion of the spec?
   
   Currently when running the following query in `Clickhouse` which uses 
`arrow` the outcome is similar to how `arrow-rs` is treating this cast (more 
below on how clickhouse can handle it):
   
   In clickhouse (arrow):
   ```
   :) SELECT toTimeZone('2024-09-10 11:45:19'::timestamp, 'America/Denver');
   
      β”Œβ”€toTimeZone(CAST('2024-09-10 11:45:19', 'timestamp'), 'America/Denver')─┐
   1. β”‚                                                    2024-09-10 10:45:19 β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   
   :) SELECT toTimeZone('2024-09-10 11:45:19'::timestamp, 
'America/Denver')::timestamp;
   
      β”Œβ”€CAST(toTimeZone(CAST('2024-09-10 11:45:19', 'timestamp'), 
'America/Denver'), 'timestamp')─┐
   1. β”‚                                                                       
2024-09-10 11:45:19 β”‚
      
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   ```
   
   In datafusion (arrow-rs):
   ```
   > select ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver');
   +-----------------------------+
   | Utf8("2024-09-10 11:45:19") |
   +-----------------------------+
   | 2024-09-10T11:45:19-06:00   |
   +-----------------------------+
   1 row(s) fetched.
   Elapsed 0.048 seconds.
   
   > select ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver')::timestamp;
   +-----------------------------+
   | Utf8("2024-09-10 11:45:19") |
   +-----------------------------+
   | 2024-09-10T17:45:19         |
   +-----------------------------+
   1 row(s) fetched.
   Elapsed 0.006 seconds.
   ```
   *Note: Datafusion uses UTC as the local time (more on this later).*
   
   
   When taking a look at Postgres I see the *opposite* functionality:
   ```
   postgres=# SELECT ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver');
         timezone
   ---------------------
    2024-09-10 10:45:19
   (1 row)
   
   postgres=# SELECT ('2024-09-10 11:45:19' AT TIME ZONE 
'America/Denver')::timestamp;
         timezone
   ---------------------
    2024-09-10 10:45:19
   (1 row)
   ```
   The timezone is effectively retained. 
   
   ---
   I filed a ticket within clickhouse to ask about this functionality: 
https://github.com/ClickHouse/ClickHouse/issues/69512 
   
   I was met with the following comment:
   
   > To clarify the current behaviour: The types DateTime <> 
DateTime('America/Denver'). It's not stripping, it kinda cast to another type.
   
   There are a few tickets I've seen scattered throughout Arrow's issues that 
sort of relate to this functionality (as far as I'm aware): 
   - https://github.com/apache/arrow/issues/41382
   - https://github.com/apache/arrow/issues/41268
   
   Most of them are closed with no indication that this will be adjusted. 
   
   ## Proposal 
   
   Even though I personally think that the way Postgres does this is "more 
correct" that is just a matter of opinion. For consistency sake it would be 
best to keep the functionality as is so it is in line with Arrow's spec. If 
someone were to be a consumer of a different product using arrow and migrating 
over/using a product that uses arrow-rs it would likely make the most sense to 
have the same functionality.
   
   I've also added a proposal to the original issue that led me to this thread: 
https://github.com/apache/datafusion/issues/12218


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to