gaogaotiantian commented on PR #52980:
URL: https://github.com/apache/spark/pull/52980#issuecomment-3528990850

   So here's my thought for this change. I think this is a bug fix, because 
converting timestamp based on the cluster machine does not even make sense. The 
task could be distributed to different machines which have different local 
timezones, there's no way the users can rely on the "existing behavior". The 
existing behavior will give inconsistent result for the same data, same udf, 
just because they are distributed to different machines - that's clearly a bug.
   
   We should at least try to make all workers use the same timezone - right now 
there's no way to do that. With this change, all the workers will respect 
`spark.sql.session.timeZone`, which is a great improvement. But that's not the 
end of it, because without that config, the clusters will still use the local 
timezone. I think an acceptable fallback behavior is to use the timezone for 
the *driver* - which will at least give a consistent result.
   
   As for the timestamp type itself, Python suggests 
[against](https://docs.python.org/3/library/datetime.html#datetime.datetime.utcfromtimestamp)
 the naive timestamp (timestamp without a timezone). I think we should give 
user a warning when they use naive timestamp and encourage them to use either 
utc, or an aware timestamp.
   
   I also root for having a timestamp type that supports timezone.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to