gaogaotiantian commented on PR #52980: URL: https://github.com/apache/spark/pull/52980#issuecomment-3530875432
> That being said, we should never rely on the local machine timezone. We should either respect the session timezone (specified by spark.sql.session.timeZone and it has a default value if not set), or the python objects should be timezone agnostic. I totally agree with this - that's the point I'm trying to make. Local machine timezone should never affect the result of user code. Let's talk about timestamps. It's discouraged in Python to have a real datetime object without timezone - because for any operation, it would be treated as with local time zone. I believe the actual internal storage uses an integer timestamp. When Python tries to convert an integer timestamp to a datetime by `datetime.datetime.fromtimestamp`, it assumes the integer to be a POSIX timestamp, so it needs a timezone to convert to a datetime. There's no real "timezone agnostic datetime" in Python - Python will assume a datetime without a timezone is local machine timezone. That being said, I think using UTC for TimestampNTZ is the correct implementation because Python will treat the integer timestamp as UTC, that should give the correct result. However, for timestamp with timezone, we should use either session config, or at least a consistent value for all executors (driver timezone would be a good candidate, UTC is another option). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
