gaogaotiantian commented on PR #52980: URL: https://github.com/apache/spark/pull/52980#issuecomment-3528990850
So here's my thought for this change. I think this is a bug fix, because converting timestamp based on the cluster machine does not even make sense. The task could be distributed to different machines which have different local timezones, there's no way the users can rely on the "existing behavior". The existing behavior will give inconsistent result for the same data, same udf, just because they are distributed to different machines - that's clearly a bug. We should at least try to make all workers use the same timezone - right now there's no way to do that. With this change, all the workers will respect `spark.sql.session.timeZone`, which is a great improvement. But that's not the end of it, because without that config, the clusters will still use the local timezone. I think an acceptable fallback behavior is to use the timezone for the *driver* - which will at least give a consistent result. As for the timestamp type itself, Python suggests [against](https://docs.python.org/3/library/datetime.html#datetime.datetime.utcfromtimestamp) the naive timestamp (timestamp without a timezone). I think we should give user a warning when they use naive timestamp and encourage them to use either utc, or an aware timestamp. I also root for having a timestamp type that supports timezone. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
