Re: [PR] [SPARK-54285][PYTHON] Cache timezone info to avoid expensive timestamp conversion [spark]

via GitHub Thu, 13 Nov 2025 09:49:28 -0800


gaogaotiantian commented on PR #52980:
URL: https://github.com/apache/spark/pull/52980#issuecomment-3528990850

So here's my thought for this change. I think this is a bug fix, because
converting timestamp based on the cluster machine does not even make sense. The
task could be distributed to different machines which have different local
timezones, there's no way the users can rely on the "existing behavior". The
existing behavior will give inconsistent result for the same data, same udf,
just because they are distributed to different machines - that's clearly a bug.

We should at least try to make all workers use the same timezone - right now
there's no way to do that. With this change, all the workers will respect
`spark.sql.session.timeZone`, which is a great improvement. But that's not the
end of it, because without that config, the clusters will still use the local
timezone. I think an acceptable fallback behavior is to use the timezone for
the *driver* - which will at least give a consistent result.

As for the timestamp type itself, Python suggests
[against](https://docs.python.org/3/library/datetime.html#datetime.datetime.utcfromtimestamp)
the naive timestamp (timestamp without a timezone). I think we should give
user a warning when they use naive timestamp and encourage them to use either
utc, or an aware timestamp.

I also root for having a timestamp type that supports timezone.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54285][PYTHON] Cache timezone info to avoid expensive timestamp conversion [spark]

Reply via email to