gaogaotiantian commented on PR #53161:
URL: https://github.com/apache/spark/pull/53161#issuecomment-3588165456

   Yeah this could be a breaking change, but this is the correct way to go. 
Mapping `TimestampType` to naive datetime object is technically not "safer" - 
it still can't be compared with an aware timestamp. It's not like naive 
timestamp has better compatibility - you have to choose one or the other.
   
   I don't have the best knowledge of pandas, but it seems like they have 
similar concerns - 
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
   
   I mean we can't really make it work properly if we mix them up. I can think 
of a few ways to make it less painful
   1. If the user uses a naive datetime and try to convert it to a 
`TimestampType` explicitly, we use `utc` for the naive timestamp instead of 
raising an error (configurable).
   2. When we infer types, we infer based on whether datetime has a timezone - 
do not automatically point to `TimestampType`.
   3. Provide a flag to keep the original behavior - name it something like 
`keep_the_wrong_timestamp_behavior`. If users are not ready, they need to 
explicitly set that flag.
   4. Generate warnings when users try to mix these things up.
   
   I agree this could be interruptive, but we can't make it right - that's the 
problem. It's a whole big mess internally and we simply can't make it better 
while keeping backward compatibility.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to