gaogaotiantian commented on PR #53161: URL: https://github.com/apache/spark/pull/53161#issuecomment-3588165456
Yeah this could be a breaking change, but this is the correct way to go. Mapping `TimestampType` to naive datetime object is technically not "safer" - it still can't be compared with an aware timestamp. It's not like naive timestamp has better compatibility - you have to choose one or the other. I don't have the best knowledge of pandas, but it seems like they have similar concerns - https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html I mean we can't really make it work properly if we mix them up. I can think of a few ways to make it less painful 1. If the user uses a naive datetime and try to convert it to a `TimestampType` explicitly, we use `utc` for the naive timestamp instead of raising an error (configurable). 2. When we infer types, we infer based on whether datetime has a timezone - do not automatically point to `TimestampType`. 3. Provide a flag to keep the original behavior - name it something like `keep_the_wrong_timestamp_behavior`. If users are not ready, they need to explicitly set that flag. 4. Generate warnings when users try to mix these things up. I agree this could be interruptive, but we can't make it right - that's the problem. It's a whole big mess internally and we simply can't make it better while keeping backward compatibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
