Hi Evan, Thank you for writing this up. For anyone else following along, there is more context / background on [1]
I personally like the "- Add support for functions to work with different timestamp types" option; In my mind this would effectively mean promoting the other timestamp types to be "first class" citizens in DataFusion. The biggest challenge, I think, is, as you mention, is "how do you specify what the timestamp resolution to use". I can think of these areas: 1. For TableProviders this can already be done in terms of `Schema::DataType` so no changes are needed there 2. For Flight and Parquet, the necessary types are already in place, so no changes are needed 3. For functions (e.g. date_trunc(...)), I think the infrastructure for multiple type signatures exists, we just need implementations for different resolutions 4. For converting back and forth from different resolutions in SQL Server For 3, given there is seemingly no SQL standard way to do this, I suggest adding a new, non-postgres function like `convert_resolution` that performs the desired conversion. This function could be implemented today as a user defined function (UDF), and/or included as a built in function in DataFusion in the future. Andrew [1] https://github.com/apache/arrow/pull/10005#discussion_r612551640 On Thu, Apr 15, 2021 at 4:41 PM Evan Chan <e...@urbanlogiq.com> wrote: > Hi folks, > > So currently Arrow Rust/DataFusion supports four types of Timestamp > arrays, with Nano, Micro, Millisecond and Second resolution. However, the > best supported by far are Nanos. For example, in DataFusion, the following > only works for Nanos and not the other resolutions: > * CAST(x as TIMESTAMP) -> Nanos only > * date_trunc() -> nanos only > * filtering a timestamp array, eg my column > > to_timestamp(‘2020-06-30T12:00Z’) > > In the broader SQL world, in general there seems to be only a single > Timestamp type in most databases, though in many cases there is a variable > resolution. > PostGres’s TIMESTAMP type is microsecond based: > https://www.postgresql.org/docs/9.1/datatype-datetime.html < > https://www.postgresql.org/docs/9.1/datatype-datetime.html> > > For UrbanLogiq, in some cases we would like to standardize on millisecond > resolution. Nanoseconds yields only ~300 years of span for i64, which > isn’t enough for some applications. At minimum, we’d like to get the > following working: > * Either something like CAST(x AS TIMESTAMP(Milliseconds)) or > date_trunc(‘milliseconds’, x) , which can cast different types of timestamp > arrays to Timestamp(Milliseconds, None) > * filtering that can compare, ideally, different types of timestamp > columns to a to_timestamp(….) > > The last problem is easy to solve as the coercion logic can be fixed to > address that, but solving the first problem is not as straightforward. > - There isn’t any universal SQL standard for a type that supports > different timestamp resolutions > - The most non-intrusive way I can think of is: > - CAST(x AS TIMESTAMP) -> Nanos > - CAST(x AS TIMESTAMP(Micros/Millis/Seconds)) ->. > Micros/Millis/Seconds arrays > - Functions are designed/work best with a single output type, and > date_trunc() is designed to output nanos only. Fixing this would not be > trivial, it would probably require changing the signature of return_type() > so that return types can be determined from argument values, not just > argument types > > Basically the larger question for the Arrow/DataFusion community is how do > we want to deal with supporting different timestamp types. The ideal would > be that different functions work on all the timestamp types, but it’ll take > a long time to get there I fear. Some possible directions: > > - Continue current support for nanos as the “first class” citizen, but add > support for casting to different timestamp resolutions, and coercion to > nanos to work with different functions like date_trunc(). > - This means Arrow would not be usable in some cases. Converting from > micros to nanos loses year range (for 64-bits anyways) > - There is a performance penalty for the coercion > - Switch to different basis for the “base” timestamp type, like PostGres > did with micros > - Use more than 64 bits to represent nanos > - Add support for functions to work with different timestamp types > - For example, cast() produces different timestamp resolutions > - Other date functions like date_trunc() can input from different > resolutions > - Signatures and return type calculation would be more complex > - Switch to a universal timestamp type which supports different > resolutions, as some SQL databases support > > Thanks for your input, > Evan