On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmck...@gmail.com> wrote:
> > The SQL standard (e.g. PostgresSQL) has two timestamp types: > with/without time zone — in some SQL implementations each slot can > have a different time zone > https://www.postgresql.org/docs/9.1/datatype-datetime.html > WITHOUT TIME ZONE: "timestamp without time zone value should be taken > or given as timezone local time" > RDBMSs conflict (universally) with ANSI. PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the epoch. It has no timezone. MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int Instant since the epoch. It has no timezone. TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive datetime" in *function*, but not in implementation: - MySQL DATETIME <https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html> is weird: 1-bit sign, 17-bit month, 5-bit day, .... - MSSQL <https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15> uses 6, 7 or 8 bytes - PostgreSQL stores an integer, but I think its epoch is still different <https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h> (it used to store doubles since 2000-01-01) ... so in general, moving datetimes from these systems into 64-bit integers is nontrivial and lossy. Spark / Databricks discusses how Spark handles this > > https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps > * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone, > and are wall clock timestamps." — not UTC-normalized > * WITH TIME ZONE: "does not affect the physical point in time that the > timestamp represents, as that is fully represented by the UTC time > instant given by the other timestamp components" > I don't use Spark, but I read that page twice. First reading, I got the same thing out of it. But the second time I read it, I read the opposite! The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP WITH SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP WITH TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp type: a 64-bit Instant since the epoch. (It also has a Date type.) If I'm reading correctly, this is exactly the same as PostgreSQL TIMESTAMP WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit timestamps as bare 64-bit integers since the epoch -- without timezone. > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be > interpreted as UTC-normalized, that would force all of these other > systems (and more) to serialize their data to be UTC-normalized Are those systems' 64-bit integers interoperable in the first place? As I understand it, there's a ton of variance out there when encoding datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres encoding is one of many. As I mentioned in another thread, programming languages all use structs. pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive" > timestamps and UTC-normalized WITH TIME ZONE. > > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be > interpreted as UTC-normalized, that would force all of these other > systems (and more) to serialize their data to be UTC-normalized (i.e. > calling the equivalent of pandas's tz_localize function) when they > convert to Arrow. Alternatives: - int64 - date32+time64 - date32+time32 This seems very harmful to me, and will make data > from these systems not accurately representable in Arrow and unable to > be round-tripped. > Certainly nobody wants to go backwards. We need to clarify: how do we store these *common* types -- MySQL TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in Arrow? Secondarily, I think: how do we recommend users store *datetimes* in Arrow? (I'd expect this to be messier, since every system/language uses a different byte structure.) Perhaps we can make a spreadsheet and look comprehensively at how many > use cases would be disenfranchised by requiring UTC normalization > always. Hear, hear! Can we also poll people to find out how they're storing Instants today? Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com