velvia commented on a change in pull request #10005:
URL: https://github.com/apache/arrow/pull/10005#discussion_r613610055
##########
File path: rust/datafusion/src/sql/planner.rs
##########
@@ -1498,6 +1498,7 @@ pub fn convert_data_type(sql: &SQLDataType) ->
Result<DataType> {
SQLDataType::Double => Ok(DataType::Float64),
SQLDataType::Char(_) | SQLDataType::Varchar(_) => Ok(DataType::Utf8),
SQLDataType::Timestamp => Ok(DataType::Timestamp(TimeUnit::Nanosecond,
None)),
+ SQLDataType::Time => Ok(DataType::Timestamp(TimeUnit::Millisecond,
None)),
Review comment:
@alamb and others: so I found that there is already an existing
`date_trunc()` function, which always returns Timestamp(Nanos). The problem
with the above proposal I put up is:
* `return_type()` in functions.rs is designed to give the return type
knowing only the input argument types. IOW, I cannot produce a different type
such as Timestamp(Millis) as I cannot inspect the time granularity value, all I
know is that I have args of type utf8 and timestamp(nanos).
* The current function signature is `Exact` and always takes in inputs of
nanosecond timestamp resolution. I suppose we can keep that... but should
think about what do people do when they have different timestamp resolutions.
Basically, there is a larger issue in Arrow/Datafusion, in that Timestamp
columns with different time resolutions are considered different types, and
most of the infrastructure is designed for or works best when there is a single
output type (and input type), it is much simpler. Also in the SQL world, for
the most part there is just a "Timestamp" type. `date_trunc()` and other
things have been designed to work with nanos, but you can in theory have
timestamp columns which have millis, micros, seconds, but they don't have the
same support that nanos do.
So the big question (and I'm not sure if this PR is the right forum for
this) is which direction should the project be going. There is a mismatch
between SQL support, the design to work with one type, and the Arrow support
for multiple time resolutions.
I can think of some possible directions/solutions, in no particular order:
- Have most functions work on nano resolution timestamps, and have automatic
coercion of other resolutions to nanos when calling functions etc. This
imposes a perf penalty, but maybe an acceptable one.
- Add support for specifying the resolution for timestamps. For example,
`CAST(xx AS TIMESTAMP(Microseconds))` would be possible. There's probably
precent for this somewhere. This would need the previous coercion though to
work.
- Somehow support just a single timestamp type in Arrow, with multiple
possible resolutions. I'm not sure how this would even be represented or
implemented, and it'd be a huge change. It might match better how things are
implemented in other parts of the SQL world though.
Would love everyone's thoughts on this. For now I'd be super happy with
the first two solutions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]