velvia commented on a change in pull request #10005: URL: https://github.com/apache/arrow/pull/10005#discussion_r613610055
########## File path: rust/datafusion/src/sql/planner.rs ########## @@ -1498,6 +1498,7 @@ pub fn convert_data_type(sql: &SQLDataType) -> Result<DataType> { SQLDataType::Double => Ok(DataType::Float64), SQLDataType::Char(_) | SQLDataType::Varchar(_) => Ok(DataType::Utf8), SQLDataType::Timestamp => Ok(DataType::Timestamp(TimeUnit::Nanosecond, None)), + SQLDataType::Time => Ok(DataType::Timestamp(TimeUnit::Millisecond, None)), Review comment: @alamb and others: so I found that there is already an existing `date_trunc()` function, which always returns Timestamp(Nanos). The problem with the above proposal I put up is: * `return_type()` in functions.rs is designed to give the return type knowing only the input argument types. IOW, I cannot produce a different type such as Timestamp(Millis) as I cannot inspect the time granularity value, all I know is that I have args of type utf8 and timestamp(nanos). * The current function signature is `Exact` and always takes in inputs of nanosecond timestamp resolution. I suppose we can keep that... but should think about what do people do when they have different timestamp resolutions. Basically, there is a larger issue in Arrow/Datafusion, in that Timestamp columns with different time resolutions are considered different types, and most of the infrastructure is designed for or works best when there is a single output type (and input type), it is much simpler. Also in the SQL world, for the most part there is just a "Timestamp" type. `date_trunc()` and other things have been designed to work with nanos, but you can in theory have timestamp columns which have millis, micros, seconds, but they don't have the same support that nanos do. So the big question (and I'm not sure if this PR is the right forum for this) is which direction should the project be going. There is a mismatch between SQL support, the design to work with one type, and the Arrow support for multiple time resolutions. I can think of some possible directions/solutions, in no particular order: - Have most functions work on nano resolution timestamps, and have automatic coercion of other resolutions to nanos when calling functions etc. This imposes a perf penalty, but maybe an acceptable one. - Add support for specifying the resolution for timestamps. For example, `CAST(xx AS TIMESTAMP(Microseconds))` would be possible. There's probably precent for this somewhere. This would need the previous coercion though to work. - Somehow support just a single timestamp type in Arrow, with multiple possible resolutions. I'm not sure how this would even be represented or implemented, and it'd be a huge change. It might match better how things are implemented in other parts of the SQL world though. Would love everyone's thoughts on this. For now I'd be super happy with the first two solutions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org