westonpace commented on issue #686: URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-874367756
> I'm not sure I'd call it a "bug" per se, but all of the to_timestamp*() functions output TimestampTypes in Arrow that have a Timezone field of None. The issues here are: This does seem like an inconsistency with the just agreed upon convention. I would expect all of these methods to convert to `Timestamp(unit, "UTC")`. > I'm aware there was a recent vote to treat the None timestamp like a local timestamp, but this isn't always the case either. For example, we have been using the output of None timestamps from DataFusion but we treat all our timestamps as UTC internally. Correct, I think DataFusion is interpreting `Timestamp(unit, None)` as "instant". This is still valid, but no longer "conventional" and could cause interoperability problems. That being said, "local" timestamps don't often get stored or travel between implementations. As long as DataFusion treats both `Timestamp(unit, None)` and `Timestamp(unit, UTC)` the same then I wouldn't say it is urgent or must-fix. It's probably a topic for discussion within DF. > Timestamp types with different timezones are not comparable / compatible types. For example, if I have data with a timezone of Some("UTC") the following fails due to incompatible types: WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z") (because timestamp_col has Some(UTC) but to_timestamp returns None) I'm not sure if you are saying the are not comparable or they should not be comparable. They "should" be comparable if you are comparing two columns with a non-None timestamp (e.g. `Timestamp(units, "UTC")` should be comparable with `Timestamp(units, "+0700")` but neither should compare to `Timestamp(units, None)`. This is going by the newly discussed "conventional" logic where `None` is "local" and `UTC` is "instant. Since DF has it backwards right now you'd need to invert the logic. > The best solution I can think of would be for to_timestamp(...) to support a second, optional argument where the timezone can be specified. As long as it is optional that is probably a good idea. However, that sounds like it is adding capability and wouldn't fix anything on its own. > There is also the possibility to simplify things and eliminate the timezone field from Arrow, but that's a really big change. For example, Parquet does not store the timezone but simply has a isAdjustedToUTC field. If True, that means timestamps are UTC, and if not that means local timestamp. I'm not sure exactly what you are saying here. `isAdjustedToUTC` basically boils down to `timezone != null` (I might have that backwards). If you are arguing to avoid using timestamps other than `UTC` or `null` then I agree that is a viable option. Implementations can always store timestamp info in a separate column or always represent "zoned date times" as structs (given that the only reason to use "zoned date time" in the first place is typically for field extraction). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org