alamb opened a new issue, #7070:
URL: https://github.com/apache/arrow-datafusion/issues/7070

   ### Is your feature request related to a problem or challenge?
   
   
   In https://github.com/apache/arrow-datafusion/pull/6832 and 
https://github.com/apache/arrow-datafusion/issues/7068 , @tustvold proposes to 
use `Duration` consistently for timestamp arithmetic which makes the code 
simpler and consistent  across DataFusion
   
   However, one implication of this change, at the time of this writigng, is 
shown 
https://github.com/apache/arrow-datafusion/pull/6832#discussion_r1267234606
   
   Specifically, in DataFusion 28.0.0 the output of `timestamp` - `timestamp` 
is a interval and is displayed like this:
   
   ```sql
   
   ❯ select (now() - '2023-01-01'::timestamp);
   +-----------------------------------------------------------+
   | now() - Utf8("2023-01-01")                                |
   +-----------------------------------------------------------+
   | 0 years 0 mons 204 days 16 hours 16 mins 8.121077000 secs |
   +-----------------------------------------------------------+
   1 row in set. Query took 0.004 seconds.
   ```
   
   However, without any additional changes after 
https://github.com/apache/arrow-datafusion/pull/6832 it is displayed in [ISO 
8601 duration](https://en.wikipedia.org/wiki/ISO_8601#Durations) format
   
   ```sql
   ❯ select (now() - '2023-01-01'::timestamp);
   +----------------------------+
   | now() - Utf8("2023-01-01") |
   +----------------------------+
   | P198DT72932.972880S        |
   +----------------------------+
   1 row in set. Query took 0.002 seconds.
   ```
   
   I think this is problematic because the output:
   1. Looks wildly different than 28.0.0
   1. Looks wildly different than how intervals are expressed in queries (e.g. 
`select now() + interval '1 year';`)
   2. Will likely cause significant confusion as the ISO duration format,  
(e.g. `P198DT72932.972880S`), is not widely used for displaying intervals to 
humans
   3. The casting logic to strings will be different (see below)
   
   Casting logic difference will means that the following query in 28.0.0
   
   ```sql
   ❯ select (now() - '2023-01-01'::timestamp)::varchar || ' ago';
   +----------------------------------------------------------------+
   | now() - Utf8("2023-01-01") || Utf8(" ago")                     |
   +----------------------------------------------------------------+
   | 0 years 0 mons 204 days 16 hours 34 mins 15.407810000 secs ago |
   +----------------------------------------------------------------+
   ```
   
   Would produce this in 29.0.0:
   ```sql
   +----------------------------------------------------------------+
   | now() - Utf8("2023-01-01") || Utf8(" ago")                     |
   +----------------------------------------------------------------+
   | P198DT72932.972880S ago |
   +----------------------------------------------------------------+
   ```
   
   
   
   
   ### Describe the solution you'd like
   
   I propose adding specific code for formatting DataFusion output, used in 
datafusion-cli, tests, and sqllogictests that formats Durations consistently 
with how `Interval`s are displayed in 28.0.0.
   
   This change will unblock the arrow upgrade in 
https://github.com/apache/arrow-datafusion/pull/6832 as well as give us a 
single location to control formatting in DataFusion making it easier to improve 
the formatting of intervals / durations (or other types) in the future if we 
desire.
   
   Anyone using DataFusion programmatically can either choose to use this new 
formatting routine, or can format Durations / Intervals to their specific need 
using arrow-rs or custom code (as we do in [IOx, for example 
(source)](https://github.com/influxdata/influxdb_iox/blob/acf9da233699ad7664645c078d83b7bc67e65f8f/arrow_util/src/display.rs#L25-L67))
   
   While this solution still suffers from the `Duration` --> `String` casting 
problem, I think it is ok because I don't think it is very common and there  is 
a workaround (to cast the duration to an interval):
   
   ```
   ❯ select (now() - '2023-01-01'::timestamp)::interval::varchar || ' ago';
   +----------------------------------------------------------------+
   | now() - Utf8("2023-01-01") || Utf8(" ago")                     |
   +----------------------------------------------------------------+
   | 0 years 0 mons 204 days 16 hours 34 mins 15.407810000 secs ago |
   +----------------------------------------------------------------+
   
   ```
   
   
   
   For those people who want even more control of duration printing, it would 
probably make sense to add a specific `to_char` or similar function: 
https://www.postgresql.org/docs/current/functions-formatting.html
   
   
   ### Describe alternatives you've considered
   
   ## Change the parsing / formattting of Durations in arrow-rs
   
   The idea is to make PRs like https://github.com/apache/arrow-rs/pull/4557 
that would change the formatting in arrow-rs
   
   Some challenges with this approach:
   1. It is not clear if there is an "ideal" canonical format is for durations
   2. It would be a backwards compatible change that would likely impact other 
projects than DataFusion
   3. Existing data that was serialized with Durations (e.g. JSON or CSV) may 
be impacted
   
   
   ## Special case String casting / parsing
   
   One way to avoid issues here would be to just special case the casting and 
parsing (`Duration` <==> `String`) in datafusion and instead of calling the 
arrow `cast` function we could call `datafusion_cast` which would have special 
handling for some type and then fall back to the arrow `cast` kernel
   
   The challenges with this approach are:
   1. It would be brittle (it would be hard to ensure that all new code called 
`datafusion_cast` rather than `cast`
   2. It is not clear that it would be entirely consistent (e.g. if there is 
code in arrow itself that calls `cast`)
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to