westonpace commented on issue #686:
URL: 
https://github.com/apache/arrow-datafusion/issues/686#issuecomment-874367756


   > I'm not sure I'd call it a "bug" per se, but all of the to_timestamp*() 
functions output TimestampTypes in Arrow that have a Timezone field of None. 
The issues here are:
   
   This does seem like an inconsistency with the just agreed upon convention.  
I would expect all of these methods to convert to `Timestamp(unit, "UTC")`.
   
   > I'm aware there was a recent vote to treat the None timestamp like a local 
timestamp, but this isn't always the case either. For example, we have been 
using the output of None timestamps from DataFusion but we treat all our 
timestamps as UTC internally.
   
   Correct, I think DataFusion is interpreting `Timestamp(unit, None)` as 
"instant".  This is still valid, but no longer "conventional" and could cause 
interoperability problems.  That being said, "local" timestamps don't often get 
stored or travel between implementations.  As long as DataFusion treats both 
`Timestamp(unit, None)` and `Timestamp(unit, UTC)` the same then I wouldn't say 
it is urgent or must-fix.  It's probably a topic for discussion within DF.
   
   > Timestamp types with different timezones are not comparable / compatible 
types. For example, if I have data with a timezone of Some("UTC") the following 
fails due to incompatible types: WHERE timestamp_col > 
to_timestamp("2021-06-21T12:00Z") (because timestamp_col has Some(UTC) but 
to_timestamp returns None)
   
   I'm not sure if you are saying the are not comparable or they should not be 
comparable.  They "should" be comparable if you are comparing two columns with 
a non-None timestamp (e.g. `Timestamp(units, "UTC")` should be comparable with 
`Timestamp(units, "+0700")` but neither should compare to `Timestamp(units, 
None)`.  This is going by the newly discussed "conventional" logic where `None` 
is "local" and `UTC` is "instant.  Since DF has it backwards right now you'd 
need to invert the logic.
   
   > The best solution I can think of would be for to_timestamp(...) to support 
a second, optional argument where the timezone can be specified.
   
   As long as it is optional that is probably a good idea.  However, that 
sounds like it is adding capability and wouldn't fix anything on its own.
   
   > There is also the possibility to simplify things and eliminate the 
timezone field from Arrow, but that's a really big change. For example, Parquet 
does not store the timezone but simply has a isAdjustedToUTC field. If True, 
that means timestamps are UTC, and if not that means local timestamp.
   
   I'm not sure exactly what you are saying here.  `isAdjustedToUTC` basically 
boils down to `timezone != null` (I might have that backwards).  If you are 
arguing to avoid using timestamps other than `UTC` or `null` then I agree that 
is a viable option.  Implementations can always store timestamp info in a 
separate column or always represent "zoned date times" as structs (given that 
the only reason to use "zoned date time" in the first place is typically for 
field extraction).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to