I've posted the examples above in
https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
because I think it would be better to collaborate there instead of linear
e-mail history and then bring the consensus back to the list.

On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I feel like we might still be talking past each other here or at least I
> don't understand the two sides of this.  I'll try to expand Weston's
> example because I think it provides the best clarification.
>
> (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000, assuming
> ms) for a timestamp column without timezone (always).   This represents an
> offset from the unix epoch.  This interpretation should not change based on
> the local system timezone.  Extracting the hour field always yields 14
> (extraction is done relative to UTC).
>
> The alternative here seems to be that we can encode (1970, 1, 2, 14, 0) in
> multiple different ways depending on what the current local system time
> is.  As a note, I think ORC and Spark do this, and it leads to
> confusion/misinterpretation when trying to transfer data.
>
> If we then convert this column to a timestamp with a timezone in "UTC"
> timezone extracting the hour field still yields 14.  If the column is
> converted to Timezone with timestamp PST.  Extracting an hour would yield 6
> (assume PST = -8GMT).    Through all of these changes the data bits do not
> change.
>
> Display is not mentioned because I think the points about how a time
> display is correct. Applications can choose what they feel makes sense to
> them (as long as they don't start automatically tacking on timezones to
> naive timestamps).  My interpretation of the specification has been display
> was kind of shorthand for field extraction.
>
> Could others on the thread confirm this is the issue up for debate?  Are
> there subtleties/operations we need to consider?
>
> I also agree that we should document recommended conversion practices from
> other systems.
>
> -Micah
>
>
>  So let's invent a third way.  I could use
>> the first 16 bits for the year, the next 8 bits for the month, the
>> next 8 bits for the day of month, the next 8 bits for the hour, the
>> next 8 bits for the minute, and the remaining bits for the seconds.
>> Using this method I would store (1970, 1, 2, 14, 0) as
>> 0x07B201020E000000.
>
> Aside, With some small variation this is what ZetaSql uses [2]
>
> [1]
> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> [2]
> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
>
>
>
> On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <a...@adamhooper.com> wrote:
>
>> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>> >
>> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
>> > with/without time zone — in some SQL implementations each slot can
>> > have a different time zone
>> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
>> > WITHOUT TIME ZONE: "timestamp without time zone value should be taken
>> > or given as timezone local time"
>> >
>>
>> RDBMSs conflict (universally) with ANSI.
>>
>> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the epoch.
>> It has no timezone.
>>
>> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int Instant
>> since the epoch. It has no timezone.
>>
>> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive datetime" in
>> *function*, but not in implementation:
>>
>>    - MySQL DATETIME
>>    <
>> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
>> >
>>    is weird: 1-bit sign, 17-bit month, 5-bit day, ....
>>    - MSSQL
>>    <
>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
>> >
>>    uses 6, 7 or 8 bytes
>>    - PostgreSQL stores an integer, but I think its epoch is still
>> different
>>    <
>> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
>> >
>> (it
>>    used to store doubles since 2000-01-01)
>>
>> ... so in general, moving datetimes from these systems into 64-bit
>> integers
>> is nontrivial and lossy.
>>
>> Spark / Databricks discusses how Spark handles this
>> >
>> >
>> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
>> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone,
>> > and are wall clock timestamps." — not UTC-normalized
>> > * WITH TIME ZONE: "does not affect the physical point in time that the
>> > timestamp represents, as that is fully represented by the UTC time
>> > instant given by the other timestamp components"
>> >
>>
>> I don't use Spark, but I read that page twice. First reading, I got the
>> same thing out of it. But the second time I read it, I read the opposite!
>>
>> The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP WITH
>> SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP
>> WITH
>> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp
>> type:
>> a 64-bit Instant since the epoch. (It also has a Date type.)
>>
>> If I'm reading correctly, this is exactly the same as PostgreSQL TIMESTAMP
>> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit
>> timestamps as bare 64-bit integers since the epoch -- without timezone.
>>
>>
>> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
>> > interpreted as UTC-normalized, that would force all of these other
>> > systems (and more) to serialize their data to be UTC-normalized
>>
>>
>> Are those systems' 64-bit integers interoperable in the first place?
>>
>> As I understand it, there's a ton of variance out there when encoding
>> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres encoding
>> is one of many. As I mentioned in another thread, programming languages
>> all
>> use structs.
>>
>> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
>> > timestamps and UTC-normalized WITH TIME ZONE.
>> >
>> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
>> > interpreted as UTC-normalized, that would force all of these other
>> > systems (and more) to serialize their data to be UTC-normalized (i.e.
>> > calling the equivalent of pandas's tz_localize function) when they
>> > convert to Arrow.
>>
>>
>> Alternatives:
>>
>>    - int64
>>    - date32+time64
>>    - date32+time32
>>
>> This seems very harmful to me, and will make data
>> > from these systems not accurately representable in Arrow and unable to
>> > be round-tripped.
>> >
>>
>> Certainly nobody wants to go backwards.
>>
>> We need to clarify: how do we store these *common* types -- MySQL
>> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in
>> Arrow?
>>
>> Secondarily, I think: how do we recommend users store *datetimes* in
>> Arrow?
>> (I'd expect this to be messier, since every system/language uses a
>> different byte structure.)
>>
>> Perhaps we can make a spreadsheet and look comprehensively at how many
>> > use cases would be disenfranchised by requiring UTC normalization
>> > always.
>>
>>
>> Hear, hear!
>>
>> Can we also poll people to find out how they're storing Instants today?
>>
>> Enjoy life,
>> Adam
>>
>> --
>> Adam Hooper
>> +1-514-882-9694
>> http://adamhooper.com
>>
>

Reply via email to