Hi folks, So currently Arrow Rust/DataFusion supports four types of Timestamp arrays, with Nano, Micro, Millisecond and Second resolution. However, the best supported by far are Nanos. For example, in DataFusion, the following only works for Nanos and not the other resolutions: * CAST(x as TIMESTAMP) -> Nanos only * date_trunc() -> nanos only * filtering a timestamp array, eg my column > to_timestamp(‘2020-06-30T12:00Z’)
In the broader SQL world, in general there seems to be only a single Timestamp type in most databases, though in many cases there is a variable resolution. PostGres’s TIMESTAMP type is microsecond based: https://www.postgresql.org/docs/9.1/datatype-datetime.html <https://www.postgresql.org/docs/9.1/datatype-datetime.html> For UrbanLogiq, in some cases we would like to standardize on millisecond resolution. Nanoseconds yields only ~300 years of span for i64, which isn’t enough for some applications. At minimum, we’d like to get the following working: * Either something like CAST(x AS TIMESTAMP(Milliseconds)) or date_trunc(‘milliseconds’, x) , which can cast different types of timestamp arrays to Timestamp(Milliseconds, None) * filtering that can compare, ideally, different types of timestamp columns to a to_timestamp(….) The last problem is easy to solve as the coercion logic can be fixed to address that, but solving the first problem is not as straightforward. - There isn’t any universal SQL standard for a type that supports different timestamp resolutions - The most non-intrusive way I can think of is: - CAST(x AS TIMESTAMP) -> Nanos - CAST(x AS TIMESTAMP(Micros/Millis/Seconds)) ->. Micros/Millis/Seconds arrays - Functions are designed/work best with a single output type, and date_trunc() is designed to output nanos only. Fixing this would not be trivial, it would probably require changing the signature of return_type() so that return types can be determined from argument values, not just argument types Basically the larger question for the Arrow/DataFusion community is how do we want to deal with supporting different timestamp types. The ideal would be that different functions work on all the timestamp types, but it’ll take a long time to get there I fear. Some possible directions: - Continue current support for nanos as the “first class” citizen, but add support for casting to different timestamp resolutions, and coercion to nanos to work with different functions like date_trunc(). - This means Arrow would not be usable in some cases. Converting from micros to nanos loses year range (for 64-bits anyways) - There is a performance penalty for the coercion - Switch to different basis for the “base” timestamp type, like PostGres did with micros - Use more than 64 bits to represent nanos - Add support for functions to work with different timestamp types - For example, cast() produces different timestamp resolutions - Other date functions like date_trunc() can input from different resolutions - Signatures and return type calculation would be more complex - Switch to a universal timestamp type which supports different resolutions, as some SQL databases support Thanks for your input, Evan