velvia commented on a change in pull request #10005:
URL: https://github.com/apache/arrow/pull/10005#discussion_r613610055



##########
File path: rust/datafusion/src/sql/planner.rs
##########
@@ -1498,6 +1498,7 @@ pub fn convert_data_type(sql: &SQLDataType) -> 
Result<DataType> {
         SQLDataType::Double => Ok(DataType::Float64),
         SQLDataType::Char(_) | SQLDataType::Varchar(_) => Ok(DataType::Utf8),
         SQLDataType::Timestamp => Ok(DataType::Timestamp(TimeUnit::Nanosecond, 
None)),
+        SQLDataType::Time => Ok(DataType::Timestamp(TimeUnit::Millisecond, 
None)),

Review comment:
       @alamb and others: so I found that there is already an existing 
`date_trunc()` function, which always returns Timestamp(Nanos).  The problem 
with the above proposal I put up is:
   * `return_type()` in functions.rs is designed to give the return type 
knowing only the input argument types.  IOW, I cannot produce a different type 
such as Timestamp(Millis) as I cannot inspect the time granularity value, all I 
know is that I have args of type utf8 and timestamp(nanos).  
   * The current function signature is `Exact` and always takes in inputs of 
nanosecond timestamp resolution.  I suppose we can keep that... but should 
think about what do people do when they have different timestamp resolutions.
   
   Basically, there is a larger issue in Arrow/Datafusion, in that Timestamp 
columns with different time resolutions are considered different types, and 
most of the infrastructure is designed for or works best when there is a single 
output type (and input type), it is much simpler.   Also in the SQL world, for 
the most part there is just a "Timestamp" type.  `date_trunc()` and other 
things have been designed to work with nanos, but you can in theory have 
timestamp columns which have millis, micros, seconds, but they don't have the 
same support that nanos do.
   
   So the big question (and I'm not sure if this PR is the right forum for 
this) is which direction should the project be going.  There is a mismatch 
between SQL support, the design to work with one type, and the Arrow support 
for multiple time resolutions.
   
   I can think of some possible directions/solutions, in no particular order:
   - Have most functions work on nano resolution timestamps, and have automatic 
coercion of other resolutions to nanos when calling functions etc.  This 
imposes a perf penalty, but maybe an acceptable one.
   - Add support for specifying the resolution for timestamps.  For example, 
`CAST(xx AS TIMESTAMP(Microseconds))` would be possible.  There's probably 
precent for this somewhere.   This would need the previous coercion though to 
work.
   - Somehow support just a single timestamp type in Arrow, with multiple 
possible resolutions.   I'm not sure how this would even be represented or 
implemented, and it'd be a huge change.  It might match better how things are 
implemented in other parts of the SQL world though.
   
   Would love everyone's thoughts on this.   For now I'd be super happy with 
the first two solutions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to