waitingkuo opened a new issue, #3100:
URL: https://github.com/apache/arrow-datafusion/issues/3100

   
   
   # !!! Please correct me if i'm wrong !!!
   
   # Intro
   
   - This changes breaks some original behaviors.
   - UTC+8 is the local time zone in my examples. 
   - As our goal is to be Postgresql compatible, this proposal is mainly 
following
     - [PostgreSQL Date/Time 
Types](https://www.postgresql.org/docs/14/datatype-datetime.html)
     - [Arrow-rs](https://github.com/apache/arrow-rs)
   - I plan to do [Data Type Formatting 
Function](https://www.postgresql.org/docs/14/functions-formatting.html) and 
[Date/Time Functions and 
Operators](https://www.postgresql.org/docs/14/functions-datetime.html) after 
most of the items here are finalized
   
   
   # Design Principle
   
   1. align with postgresql in most cases
   2. use nanoseconds for the default timestamp resolution (unlike postgresql 
and pyspark's microseconds)
   3. use utc+0 as default timezone for  timestamp with time zone
   
   
   # Let's Begin with Postgresql's Date/Time
   
![image](https://user-images.githubusercontent.com/1100923/184016648-0caf5e6b-3084-4e53-9fed-bd225ec027f9.png)
   
   # Let's Start to Compare
   
   <hr>
   
   ## Timestamp
   
   #### Postgresql
   - uses 8 bytes for both `timestamp` and `timestamp with time zone`. (note 
that time zone is included in these 8 bytes)
   - uses microsecond as resolution, which is the number of microseconds from 
`1970-01-01T00:00:00`
   - has 7 kinds of floating point precision, from `timestamp(0)` to 
`timestamp(6)`
     - `timestamp(0)` rounds to seconds
     - `timestamp(3)` rounds to milliseconds
     - `timestamp(6)` rounds to microseconds 
   - `timestamp 'xxx'` output `timestamp`
   1. if `xxx` does't contain time zone info, it just works as what you think
   
   ```bash
   willy=# select timestamp '2000-01-01T00:00:00';
         timestamp      
   ---------------------
    2000-01-01 00:00:00
   (1 row)
   ```  
   2. if `xxx` contains time zone info, time zone is just ignored. (i believe 
that this is a surprise for some people) e.g. 
   ```bash
   willy=# select timestamp '2000-01-01T00:00:00+08:00';
         timestamp      
   ---------------------
    2000-01-01 00:00:00
   (1 row)
   ```
   
   - `timestamp with time zone 'xxx'` output `timestamp with time`
   1 if `xxx` contains no time zone, it assume it's local time
   ```bash
   willy=# select timestamp with time zone '2000-01-01T00:00:00';
         timestamptz       
   ------------------------
    2000-01-01 00:00:00+08
   (1 row)
   ```
     
   2 if `xxx` contains time zone, it'll be converted to your local time zone
   ```bash
   willy=# select timestamp with time zone '2000-01-01T00:00:00+02:00';
         timestamptz       
   ------------------------
    2000-01-01 06:00:00+08
   (1 row)
   ```
   
   #### Datafusion
   - `Timestamp(TimeUnit, Option<String>)`
     - we have 
       - `TimeUnit::Second`
       - `TimeUnit::MilliSecond`
       - `TImeUnit::MicroSecond`
       - `TimeUnit::NanoSecond` 
     - which store number of seconds/millis/micros/nanos from 
`1970-01-01T00:00:00`
     - most of the timestamp related functions output 
`Timestamp(TimeUnit::NanoSecond, None)`
   - We only have `timestamp` literal but no `timestamp with time zone`
   - `timestamp xxx` outputs `Timestamp(TimeUnit::NanoSecond, None)`
   1. if `xxx` contains no time zone, it automatically applies local time, 
parse it, convert it to utc time zone, and then drop the time zone #3080
   ```bash
   ❯ select cast('2000-01-01T00:00:00' as timestamp);
   +------------------------------------------------------------------+
   | CAST(Utf8("2000-01-01T00:00:00") AS Timestamp(Nanosecond, None)) |
   +------------------------------------------------------------------+
   | 1999-12-31 16:00:00                                              |
   +------------------------------------------------------------------+
   1 row in set. Query took 0.004 seconds.
   ```
   2. if `xxx` contains time zone, it's parsed correctly, then converted to utc 
time zone, and then drop the time zone
     ```
     ❯ select timestamp '2000-01-01T00:00:00+02:00';
   +------------------------------------------------------------------------+
   | CAST(Utf8("2000-01-01T00:00:00+02:00") AS Timestamp(Nanosecond, None)) |
   +------------------------------------------------------------------------+
   | 1999-12-31 22:00:00                                                    |
   +------------------------------------------------------------------------+
   1 row in set. Query took 0.000 seconds.
      ```
   
   #### Proposal
   
   - make `timestamp xxx` work like postgresql does
   - add `timestamp with time zone`, i believe there're lots of works and 
discussions to do: apache/arrow-rs#1936 apache/arrow-rs#1380 
apache/arrow-rs#597 
   - make the local time zone as UTC by default (timestamp is hard, i think 
this could prevent some ambiguities)
   - add `set time zone to xxx` to change the local time zone
   
   <hr>
   
   ## Date
   
   postgresql image
   
   #### Posgresql 
   
   - supported these formats
   
![image](https://user-images.githubusercontent.com/1100923/184018775-a5ee456f-3aec-4f3d-b10e-9c3adf104036.png)
   
   
   #### Datafusion
   
   - among all the format above, we only support the first `1999-01-08`
   
   
   #### Proposal
   - I don't think there're any issues for `Date`
   - We could consider add another 2 ISO 8601 formats (i.e. `19990108` and 
`990108`) [Chrono](https://github.com/chronotope/chrono) strictly follows ISO 
8601. I think supporting all 8601 date formats makes sense.
   
   <hr>
   
   ## Time
   
   #### Postgresql
   
   - `time xxx` output `time` that requires 8 bytes
   - `time xxx with time zone` that requires 12 bytes, I have no idea why we 
need 4 more bytes here since `timestamp with time zone` only requires 8 bytes
   
   #### Datafusion
   
   - We don't have `time` literal for now, let's wait for #3010
   
   #### Proposal
   
   - I personally never used `time with time zone`. I have no clue when we need 
it. `arrow-rs`'s `time` datatype contains no timezone. Perhaps we need not to 
implement this.
   - let's wait for #3010
   
   <hr>
   
   ## Interval
   
   #### Postgresql
   
   - requires 16 bytes
   - resolution as microseconds
   - the outputs for different operators & inputs
   
![image](https://user-images.githubusercontent.com/1100923/184020230-ec8b2e7a-b37a-41f2-ad47-52d56c02e771.png)
   
![image](https://user-images.githubusercontent.com/1100923/184020334-05548a79-891f-41ac-a1e2-0813df8e3d73.png)
   
   
   
   #### Datafusion
   
   reference: 
<https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs#L237>
   
   - `Interval(IntervalUnit)`
   - we have following units
     - `IntervalUnit::YearMonth`
       - number of months
       - stored as 32-bit integer
     - `IntervalUnit::DayTime`
       - stored as 2 contiguous 32-bit integers (days, millisseconds), 8 bytes 
in total
     - `IntervalUnit::MonthDayNano`
     - a triple of the number of (months, days, nanoseconds)
       - month is stored as 32-bit integers
       - day is stored as 32-bit integers
       - nanosecond is stored as 64 bit integers
       - 16 bytes in total
   - `interval xxx` output `Interval(DayTime)`
   ```bash
   ❯ select interval '1 hour 1 second';
   +------------------------------------------------+
   | IntervalDayTime("3601000")                     |
   +------------------------------------------------+
   | 0 years 0 mons 0 days 1 hours 0 mins 1.00 secs |
   +------------------------------------------------+
   1 row in set. Query took 0.000 seconds.
   ```
   - `interval xxx` support floating number seconds
   ```bash
   ❯ select interval '0.1 second';
   +-------------------------------------------------+
   | IntervalDayTime("100")                          |
   +-------------------------------------------------+
   | 0 years 0 mons 0 days 0 hours 0 mins 0.100 secs |
   +-------------------------------------------------+
   1 row in set. Query took 0.000 seconds.
   ```
   - if it's less than microsecond, it'll truncated
   ```bash
   ❯ select interval '0.0001 second';
   +------------------------------------------------+
   | IntervalDayTime("0")                           |
   +------------------------------------------------+
   | 0 years 0 mons 0 days 0 hours 0 mins 0.00 secs |
   +------------------------------------------------+
   1 row in set. Query took 0.000 seconds.
   ```
   - we cannot add `interval(DayTime)` to `Timestamp(NanoSecond, None)`, 
perhaps the reason here is the difference of resolution
   ```bash
   ❯ select timestamp '2000-01-01Z' + interval '1 day';
   Plan("'Timestamp(Nanosecond, None) + Interval(DayTime)' can't be evaluated 
because there isn't a common type to coerce the types to")
   ```
   - we can add `interval(DayTime)` to `Date`
   ```bash
   ❯ select DATE '2000-01-01' + INTERVAL '1 day';
   +--------------------------------------------------------------------+
   | CAST(Utf8("2000-01-01") AS Date32) + IntervalDayTime("4294967296") |
   +--------------------------------------------------------------------+
   | 2000-01-01                                                         |
   +--------------------------------------------------------------------+
   1 row in set. Query took 0.000 seconds.
   ```
   - it breaks while we have hour (or other smaller units) interval #3093
   ```bash
   ❯ select DATE '2000-01-01' + INTERVAL '1 hour';
   +-----------------------------------------------------------------+
   | CAST(Utf8("2000-01-01") AS Date32) + IntervalDayTime("3600000") |
   +-----------------------------------------------------------------+
   | +11856-06-19                                                    |
   +-----------------------------------------------------------------+
   1 row in set. Query took 0.000 seconds.
   ```
   - We don't have `Time` now, let's wait for #3010
   
   
   #### Proposal
   
   - Consider make `INTERVAL xxx` outputs Interval(MonthDayNano) instead of 
Interval(DayTime) as it's easier to align with our `Timestamp(NanoSecond, None)`
   - Carefully design the outputs for operators like what postgresql has
   - we could think about whether we really need `timestamp with time zone - 
timestamp` ... this is what postgresql has
   ```bash
   willy=# select timestamp with time zone '2000-01-01T00:00:00Z' - timestamp 
'2000-01-01T00:00:00';
    ?column? 
   ----------
    08:00:00
   (1 row)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to