I've posted the examples above in https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing because I think it would be better to collaborate there instead of linear e-mail history and then bring the consensus back to the list.
On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > I feel like we might still be talking past each other here or at least I > don't understand the two sides of this. I'll try to expand Weston's > example because I think it provides the best clarification. > > (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000, assuming > ms) for a timestamp column without timezone (always). This represents an > offset from the unix epoch. This interpretation should not change based on > the local system timezone. Extracting the hour field always yields 14 > (extraction is done relative to UTC). > > The alternative here seems to be that we can encode (1970, 1, 2, 14, 0) in > multiple different ways depending on what the current local system time > is. As a note, I think ORC and Spark do this, and it leads to > confusion/misinterpretation when trying to transfer data. > > If we then convert this column to a timestamp with a timezone in "UTC" > timezone extracting the hour field still yields 14. If the column is > converted to Timezone with timestamp PST. Extracting an hour would yield 6 > (assume PST = -8GMT). Through all of these changes the data bits do not > change. > > Display is not mentioned because I think the points about how a time > display is correct. Applications can choose what they feel makes sense to > them (as long as they don't start automatically tacking on timezones to > naive timestamps). My interpretation of the specification has been display > was kind of shorthand for field extraction. > > Could others on the thread confirm this is the issue up for debate? Are > there subtleties/operations we need to consider? > > I also agree that we should document recommended conversion practices from > other systems. > > -Micah > > > So let's invent a third way. I could use >> the first 16 bits for the year, the next 8 bits for the month, the >> next 8 bits for the day of month, the next 8 bits for the hour, the >> next 8 bits for the minute, and the remaining bits for the seconds. >> Using this method I would store (1970, 1, 2, 14, 0) as >> 0x07B201020E000000. > > Aside, With some small variation this is what ZetaSql uses [2] > > [1] > https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark > [2] > https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62 > > > > On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <a...@adamhooper.com> wrote: > >> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmck...@gmail.com> wrote: >> >> > >> > The SQL standard (e.g. PostgresSQL) has two timestamp types: >> > with/without time zone — in some SQL implementations each slot can >> > have a different time zone >> > https://www.postgresql.org/docs/9.1/datatype-datetime.html >> > WITHOUT TIME ZONE: "timestamp without time zone value should be taken >> > or given as timezone local time" >> > >> >> RDBMSs conflict (universally) with ANSI. >> >> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the epoch. >> It has no timezone. >> >> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int Instant >> since the epoch. It has no timezone. >> >> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive datetime" in >> *function*, but not in implementation: >> >> - MySQL DATETIME >> < >> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html >> > >> is weird: 1-bit sign, 17-bit month, 5-bit day, .... >> - MSSQL >> < >> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15 >> > >> uses 6, 7 or 8 bytes >> - PostgreSQL stores an integer, but I think its epoch is still >> different >> < >> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h >> > >> (it >> used to store doubles since 2000-01-01) >> >> ... so in general, moving datetimes from these systems into 64-bit >> integers >> is nontrivial and lossy. >> >> Spark / Databricks discusses how Spark handles this >> > >> > >> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps >> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone, >> > and are wall clock timestamps." — not UTC-normalized >> > * WITH TIME ZONE: "does not affect the physical point in time that the >> > timestamp represents, as that is fully represented by the UTC time >> > instant given by the other timestamp components" >> > >> >> I don't use Spark, but I read that page twice. First reading, I got the >> same thing out of it. But the second time I read it, I read the opposite! >> >> The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP WITH >> SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP >> WITH >> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp >> type: >> a 64-bit Instant since the epoch. (It also has a Date type.) >> >> If I'm reading correctly, this is exactly the same as PostgreSQL TIMESTAMP >> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit >> timestamps as bare 64-bit integers since the epoch -- without timezone. >> >> >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be >> > interpreted as UTC-normalized, that would force all of these other >> > systems (and more) to serialize their data to be UTC-normalized >> >> >> Are those systems' 64-bit integers interoperable in the first place? >> >> As I understand it, there's a ton of variance out there when encoding >> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres encoding >> is one of many. As I mentioned in another thread, programming languages >> all >> use structs. >> >> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive" >> > timestamps and UTC-normalized WITH TIME ZONE. >> > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be >> > interpreted as UTC-normalized, that would force all of these other >> > systems (and more) to serialize their data to be UTC-normalized (i.e. >> > calling the equivalent of pandas's tz_localize function) when they >> > convert to Arrow. >> >> >> Alternatives: >> >> - int64 >> - date32+time64 >> - date32+time32 >> >> This seems very harmful to me, and will make data >> > from these systems not accurately representable in Arrow and unable to >> > be round-tripped. >> > >> >> Certainly nobody wants to go backwards. >> >> We need to clarify: how do we store these *common* types -- MySQL >> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in >> Arrow? >> >> Secondarily, I think: how do we recommend users store *datetimes* in >> Arrow? >> (I'd expect this to be messier, since every system/language uses a >> different byte structure.) >> >> Perhaps we can make a spreadsheet and look comprehensively at how many >> > use cases would be disenfranchised by requiring UTC normalization >> > always. >> >> >> Hear, hear! >> >> Can we also poll people to find out how they're storing Instants today? >> >> Enjoy life, >> Adam >> >> -- >> Adam Hooper >> +1-514-882-9694 >> http://adamhooper.com >> >