Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 To Wes's concern, I think we are only dealing with values in UTC here, both Spark and Arrow internally represents timestamp as microseconds since epoch. To the two issues Bryan and Ueshin brought up: Issue 1: I agree with Ueshin we should stick to `SESSION_LOCAL_TIMEZONE`. Bryan brought up a good point there in pyspark `df.toPandas()`, `df.collect()` and the python udf (through `Timestamp.fromInternal`) doesn't respect `SESSION_LOCAL_TIMEZONE` and therefore is confusing and inconsistent with Spark SQL behavior such as `df.show()`. Since it's going to be either inconsistent with Spark SQL (df.show()) or inconsistent with PySpark (i.e., the default df.toPandas()), I'd rather we do the right thing (by using `SESSION_LOCAL_TIMEZONE`) and fix other PySpark behavior separately. Issue 2: I agree with Bryan that we leave the timezone as is. I don't think there is performance issue because like Wes mentioned, it's just metadata operation. I think converting it back to system timezone defeat the purpose of using session timezone and throwing away the tzinfo seems unnecessary.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org