Hi folks,

So currently Arrow Rust/DataFusion supports four types of Timestamp arrays, 
with Nano, Micro, Millisecond and Second resolution.  However, the best 
supported by far are Nanos.  For example, in DataFusion, the following only 
works for Nanos and not the other resolutions:
* CAST(x as TIMESTAMP)  -> Nanos only
* date_trunc()  -> nanos only
* filtering a timestamp array, eg my column > to_timestamp(‘2020-06-30T12:00Z’)

In the broader SQL world, in general there seems to be only a single Timestamp 
type in most databases, though in many cases there is a variable resolution.  
PostGres’s TIMESTAMP type is microsecond based:  
https://www.postgresql.org/docs/9.1/datatype-datetime.html 
<https://www.postgresql.org/docs/9.1/datatype-datetime.html>

For UrbanLogiq, in some cases we would like to standardize on millisecond 
resolution.  Nanoseconds yields only ~300 years of span for i64, which isn’t 
enough for some applications.  At minimum, we’d like to get the following 
working:
* Either something like CAST(x AS TIMESTAMP(Milliseconds)) or 
date_trunc(‘milliseconds’, x) , which can cast different types of timestamp 
arrays to Timestamp(Milliseconds, None)
* filtering that can compare, ideally, different types of timestamp columns to 
a to_timestamp(….)

The last problem is easy to solve as the coercion logic can be fixed to address 
that, but solving the first problem is not as straightforward.
- There isn’t any universal SQL standard for a type that supports different 
timestamp resolutions
- The most non-intrusive way I can think of is:
    - CAST(x AS TIMESTAMP)   -> Nanos
    - CAST(x AS TIMESTAMP(Micros/Millis/Seconds)) ->. Micros/Millis/Seconds 
arrays
- Functions are designed/work best with a single output type, and date_trunc() 
is designed to output nanos only.  Fixing this would not be trivial, it would 
probably require changing the signature of return_type() so that return types 
can be determined from argument values, not just argument types

Basically the larger question for the Arrow/DataFusion community is how do we 
want to deal with supporting different timestamp types.  The ideal would be 
that different functions work on all the timestamp types, but it’ll take a long 
time to get there I fear.  Some possible directions:

- Continue current support for nanos as the “first class” citizen, but add 
support for casting to different timestamp resolutions, and coercion to nanos 
to work with different functions like date_trunc().
    - This means Arrow would not be usable in some cases.  Converting from 
micros to nanos loses year range (for 64-bits anyways)
    - There is a performance penalty for the coercion
- Switch to different basis for the “base” timestamp type, like PostGres did 
with micros
- Use more than 64 bits to represent nanos
- Add support for functions to work with different timestamp types
    - For example, cast() produces different timestamp resolutions
    - Other date functions like date_trunc() can input from different 
resolutions
    - Signatures and return type calculation would be more complex
- Switch to a universal timestamp type which supports different resolutions, as 
some SQL databases support

Thanks for your input,
Evan

Reply via email to