cloud-fan commented on PR #53161: URL: https://github.com/apache/spark/pull/53161#issuecomment-3581415846
@gaogaotiantian The key of Spark `TimestampType` is that it's an absolute time. The session timezone only matters when we render the timestamp without timezone (e.g. `df.show`, or cast to string, or functions that get year/month/.../second fields from timestamp). For the case of `df = spark.createDataFrame([(datetime.datetime(1990, 8, 10, 0, 0),)], ["ts"])`, we use a specific session `spark` to create the dataframe, and apparently we should respect its session timezone. We should convert `datetime.datetime(1990, 8, 10, 0, 0)` to an absolute timestamp by attaching the session timezone to it. Moreover, we can have a mix of python `datetime.datetime` objects which have different timezones or no timezone, and it's OK because we can still convert them to absolute timestamps. A similar example is reading JDBC table that contains column with standard TIMESTAMP WITH TIMEZONE type. Each value can have a different timezone but it's still OK to read it as Spark `TimestampType`, because they can be converted to absolute timestamps. Under the hood, `TimestampType` is stored as int64 in memory, which means number of microseconds from UTC epoch (`1970-01-01 00:00:00 Z`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
