Re: [Format][Important] Needed clarification of timezone-less timestamps

Julian Hyde Tue, 22 Jun 2021 13:23:32 -0700

Thanks, Weston. I guess I’m in the camp that wants Instant to be a first-class 
type, on a par with ZonedDateTime and LocalDateTime. I have added my comments 
to the “Arguments for” section in the 2nd doc.


> On Jun 22, 2021, at 10:38 AM, Weston Pace <weston.p...@gmail.com> wrote:
> 
> There's two (related) proposals that are being discussed I think.  I
> created streamline documents for each.  Feel free to add arguments
> for/against or to clarify.  If no one has any better ideas I'll start
> a vote topic for each one in ~48 hours.
> 
> # Proposal: Clarify meaning of timestamp without time zone
> * 
> https://docs.google.com/document/d/1wDAuxEDVo3YxZx20fGUGqQxi3aoss7TJ-TzOUjaoZk8/edit?usp=sharing
> 
> # Proposal: Arrow should define how an “Instant” is stored
> * 
> https://docs.google.com/document/d/1xEKRhs-GUSMwjMhgmQdnCNMXwZrA10226AcXRoP8g9E/edit?usp=sharing
> 
> 
> On Tue, Jun 22, 2021 at 6:01 AM Julian Hyde <jhyde.apa...@gmail.com> wrote:
>> 
>> My proposal is that Arrow should support three different kinds of 
>> date-times: zoneless, zoned, and instant. (Not necessarily with those names.)
>> 
>> All three kinds occur frequently in the industry.
>> 
>> Many systems only have two, and users of those systems have figured out how 
>> to make do. (For example, you can implement an instant using a zoneless, and 
>> vice versa, if you are careful.) But let’s suppose that Arrow has two of the 
>> three, and needs to interoperate with a system that has a different two of 
>> the three. Chaos ensues.
>> 
>> Let’s just implement all three.
>> 
>> Julian
>> 
>> 
>> 
>> 
>>> On Jun 22, 2021, at 8:46 AM, Wes McKinney <wesmck...@gmail.com> wrote:
>>> 
>>> Let's see a streamlined document about what we are voting on — I have
>>> limited bandwidth to read through and synthesize the discussion myself
>>> and I am probably not the only one. It has always been my
>>> understanding to represent time elapsed from the UNIX epoch
>>> (1970-01-01 00:00:00 UTC)  as a timestamp with tz=UTC (which I believe
>>> is what is being called an "instant").
>>> 
>>> On Tue, Jun 22, 2021 at 9:01 AM Adam Hooper <a...@adamhooper.com> wrote:
>>>> 
>>>> Maybe Arrow should add a new type, "INSTANT"?
>>>> 
>>>> Instant (seconds since the epoch) is the gold standard in storing moments
>>>> in time. All programming languages; all RDBMSs; Parquet ... everybody uses
>>>> this.
>>>> 
>>>> I use Instants, too. I interpreted TIMESTAMP with no metadata to mean
>>>> Instant because I read the docs *assuming* Arrow stores Instant.
>>>> 
>>>> I know, I know, no vote can stop me from interpreting "timestamp without
>>>> timezone" however the heck I want. But it's painful for me to transition
>>>> from happy user to heretic.
>>>> 
>>>> Voting to clarify that Arrow doesn't store Instants is voting to clarify
>>>> that Arrow *doesn't* do something extremely useful. It's voting for a
>>>> negative. That sounds painful! What if there were positives to vote for? An
>>>> "INSTANT" type? A new TIMESTAMP metadata field, "instant" (on by default)?
>>>> A fiat that timezone=UTC means Instant, not ZonedDateTime?
>>>> 
>>>> Enjoy life,
>>>> Adam
>>>> 
>>>> On Mon, Jun 21, 2021 at 11:39 PM Weston Pace <weston.p...@gmail.com> wrote:
>>>> 
>>>>> I agree that a vote would be a good idea.  Do you want to start a
>>>>> dedicated vote thread?  I can write one up too if you'd rather.
>>>>> 
>>>>> -Weston
>>>>> 
>>>>> On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <emkornfi...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> I think comments on the doc are tailing off.  Jorge's test cases I think
>>>>>> still need some more careful analysis but Weston has provided an
>>>>>> initial pass.
>>>>>> 
>>>>>> The matter not resolved on the document is whether Timestamp with
>>>>> timezone
>>>>>> logically represents multi-field date and time (that does not represent a
>>>>>> specific instant) or whether it logically represents an instant (some
>>>>>> measurable offset from an epoch).   Based on comments on the
>>>>> documentation
>>>>>> both C++/Python implementations and the Java implementations (those that
>>>>>> have generally been considered "reference") both have evidence the the
>>>>>> former representation is what is intended (some links are in the
>>>>> document).
>>>>>> 
>>>>>> We can probably continue to debate what is useful but it seems ultimately
>>>>>> we need to pick one or the other and clarify the specification.  Given
>>>>> how
>>>>>> the reference implementations currently work I think we should error on
>>>>> the
>>>>>> side of interpreting these values as date times.  Ultimately, given the
>>>>>> contention here we will likely need to vote on this.
>>>>>> 
>>>>>> More comments on the document or here are still useful in case we've
>>>>> missed
>>>>>> an interpretation or there are other facts to consider.
>>>>>> 
>>>>>> Cheers,
>>>>>> Micah
>>>>>> 
>>>>>> On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão <
>>>>>> jorgecarlei...@gmail.com> wrote:
>>>>>> 
>>>>>>> Thank you everyone for participating so far; really important and
>>>>>>> useful discussion.
>>>>>>> 
>>>>>>> I think of this discussion as a set of test cases over behavior:
>>>>>>> 
>>>>>>> parameterization:
>>>>>>> * Timestamp(ms, None)
>>>>>>> * Timestamp(ms, "00:00")
>>>>>>> * Timestamp(ms, "01:00")
>>>>>>> 
>>>>>>> Cases:
>>>>>>> * its string representation equals to
>>>>>>> * add a duration equals to
>>>>>>> * add an interval equals to
>>>>>>> * subtract a Timestamp(ms, None) equals to
>>>>>>> * subtract a Timestamp(ms, "01:00") equals to
>>>>>>> * subtract a Date32 equals to
>>>>>>> * subtract a Time32(ms) equals to
>>>>>>> * extract the day equals to
>>>>>>> * extract the timezone equals to
>>>>>>> * cast to Timestamp(ms, None) equals to
>>>>>>> * cast to Timestamp(ms, "01:00") equals to
>>>>>>> * write to parquet v2 equals to (physical value and logical type)
>>>>>>> 
>>>>>>> In all cases, the result may either be valid or invalid. If valid, we
>>>>>>> would need a datatype and an actual value.
>>>>>>> I was hoping to be able to answer each of the above at the end of this
>>>>>>> discussion.
>>>>>>> 
>>>>>>> I've suggested adding these in the google docs.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Jorge
>>>>>>> 
>>>>>>> On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <
>>>>> emkornfi...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> I've posted the examples above in
>>>>>>>> 
>>>>>>> 
>>>>> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
>>>>>>>> because I think it would be better to collaborate there instead of
>>>>> linear
>>>>>>>> e-mail history and then bring the consensus back to the list.
>>>>>>>> 
>>>>>>>> On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <
>>>>> emkornfi...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I feel like we might still be talking past each other here or at
>>>>> least
>>>>>>> I
>>>>>>>>> don't understand the two sides of this.  I'll try to expand
>>>>> Weston's
>>>>>>>>> example because I think it provides the best clarification.
>>>>>>>>> 
>>>>>>>>> (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000,
>>>>>>> assuming
>>>>>>>>> ms) for a timestamp column without timezone (always).   This
>>>>>>> represents an
>>>>>>>>> offset from the unix epoch.  This interpretation should not change
>>>>>>> based on
>>>>>>>>> the local system timezone.  Extracting the hour field always
>>>>> yields 14
>>>>>>>>> (extraction is done relative to UTC).
>>>>>>>>> 
>>>>>>>>> The alternative here seems to be that we can encode (1970, 1, 2,
>>>>> 14,
>>>>>>> 0) in
>>>>>>>>> multiple different ways depending on what the current local system
>>>>> time
>>>>>>>>> is.  As a note, I think ORC and Spark do this, and it leads to
>>>>>>>>> confusion/misinterpretation when trying to transfer data.
>>>>>>>>> 
>>>>>>>>> If we then convert this column to a timestamp with a timezone in
>>>>> "UTC"
>>>>>>>>> timezone extracting the hour field still yields 14.  If the column
>>>>> is
>>>>>>>>> converted to Timezone with timestamp PST.  Extracting an hour would
>>>>>>> yield 6
>>>>>>>>> (assume PST = -8GMT).    Through all of these changes the data
>>>>> bits do
>>>>>>> not
>>>>>>>>> change.
>>>>>>>>> 
>>>>>>>>> Display is not mentioned because I think the points about how a
>>>>> time
>>>>>>>>> display is correct. Applications can choose what they feel makes
>>>>> sense
>>>>>>> to
>>>>>>>>> them (as long as they don't start automatically tacking on
>>>>> timezones to
>>>>>>>>> naive timestamps).  My interpretation of the specification has been
>>>>>>> display
>>>>>>>>> was kind of shorthand for field extraction.
>>>>>>>>> 
>>>>>>>>> Could others on the thread confirm this is the issue up for debate?
>>>>>>> Are
>>>>>>>>> there subtleties/operations we need to consider?
>>>>>>>>> 
>>>>>>>>> I also agree that we should document recommended conversion
>>>>> practices
>>>>>>> from
>>>>>>>>> other systems.
>>>>>>>>> 
>>>>>>>>> -Micah
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> So let's invent a third way.  I could use
>>>>>>>>>> the first 16 bits for the year, the next 8 bits for the month, the
>>>>>>>>>> next 8 bits for the day of month, the next 8 bits for the hour,
>>>>> the
>>>>>>>>>> next 8 bits for the minute, and the remaining bits for the
>>>>> seconds.
>>>>>>>>>> Using this method I would store (1970, 1, 2, 14, 0) as
>>>>>>>>>> 0x07B201020E000000.
>>>>>>>>> 
>>>>>>>>> Aside, With some small variation this is what ZetaSql uses [2]
>>>>>>>>> 
>>>>>>>>> [1]
>>>>>>>>> 
>>>>>>> 
>>>>> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
>>>>>>>>> [2]
>>>>>>>>> 
>>>>>>> 
>>>>> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <a...@adamhooper.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmck...@gmail.com
>>>>>> 
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> The SQL standard (e.g. PostgresSQL) has two timestamp types:
>>>>>>>>>>> with/without time zone — in some SQL implementations each slot
>>>>> can
>>>>>>>>>>> have a different time zone
>>>>>>>>>>> https://www.postgresql.org/docs/9.1/datatype-datetime.html
>>>>>>>>>>> WITHOUT TIME ZONE: "timestamp without time zone value should be
>>>>>>> taken
>>>>>>>>>>> or given as timezone local time"
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> RDBMSs conflict (universally) with ANSI.
>>>>>>>>>> 
>>>>>>>>>> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since
>>>>> the
>>>>>>> epoch.
>>>>>>>>>> It has no timezone.
>>>>>>>>>> 
>>>>>>>>>> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int
>>>>>>> Instant
>>>>>>>>>> since the epoch. It has no timezone.
>>>>>>>>>> 
>>>>>>>>>> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive
>>>>>>> datetime" in
>>>>>>>>>> *function*, but not in implementation:
>>>>>>>>>> 
>>>>>>>>>>  - MySQL DATETIME
>>>>>>>>>>  <
>>>>>>>>>> 
>>>>>>> 
>>>>> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
>>>>>>>>>>> 
>>>>>>>>>>  is weird: 1-bit sign, 17-bit month, 5-bit day, ....
>>>>>>>>>>  - MSSQL
>>>>>>>>>>  <
>>>>>>>>>> 
>>>>>>> 
>>>>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
>>>>>>>>>>> 
>>>>>>>>>>  uses 6, 7 or 8 bytes
>>>>>>>>>>  - PostgreSQL stores an integer, but I think its epoch is still
>>>>>>>>>> different
>>>>>>>>>>  <
>>>>>>>>>> 
>>>>>>> 
>>>>> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
>>>>>>>>>>> 
>>>>>>>>>> (it
>>>>>>>>>>  used to store doubles since 2000-01-01)
>>>>>>>>>> 
>>>>>>>>>> ... so in general, moving datetimes from these systems into 64-bit
>>>>>>>>>> integers
>>>>>>>>>> is nontrivial and lossy.
>>>>>>>>>> 
>>>>>>>>>> Spark / Databricks discusses how Spark handles this
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
>>>>>>>>>>> * WITHOUT TIME ZONE: "These timestamps are not bound to any time
>>>>>>> zone,
>>>>>>>>>>> and are wall clock timestamps." — not UTC-normalized
>>>>>>>>>>> * WITH TIME ZONE: "does not affect the physical point in time
>>>>> that
>>>>>>> the
>>>>>>>>>>> timestamp represents, as that is fully represented by the UTC
>>>>> time
>>>>>>>>>>> instant given by the other timestamp components"
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I don't use Spark, but I read that page twice. First reading, I
>>>>> got
>>>>>>> the
>>>>>>>>>> same thing out of it. But the second time I read it, I read the
>>>>>>> opposite!
>>>>>>>>>> 
>>>>>>>>>> The key part is: "*Spark SQL defines the timestamp type as
>>>>> TIMESTAMP
>>>>>>> WITH
>>>>>>>>>> SESSION TIME ZONE*," -- in other words, Spark doesn't have a
>>>>> TIMESTAMP
>>>>>>>>>> WITH
>>>>>>>>>> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one
>>>>> Timestamp
>>>>>>>>>> type:
>>>>>>>>>> a 64-bit Instant since the epoch. (It also has a Date type.)
>>>>>>>>>> 
>>>>>>>>>> If I'm reading correctly, this is exactly the same as PostgreSQL
>>>>>>> TIMESTAMP
>>>>>>>>>> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and
>>>>> transmit
>>>>>>>>>> timestamps as bare 64-bit integers since the epoch -- without
>>>>>>> timezone.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
>>>>>>>>>>> interpreted as UTC-normalized, that would force all of these
>>>>> other
>>>>>>>>>>> systems (and more) to serialize their data to be UTC-normalized
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Are those systems' 64-bit integers interoperable in the first
>>>>> place?
>>>>>>>>>> 
>>>>>>>>>> As I understand it, there's a ton of variance out there when
>>>>> encoding
>>>>>>>>>> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres
>>>>>>> encoding
>>>>>>>>>> is one of many. As I mentioned in another thread, programming
>>>>>>> languages
>>>>>>>>>> all
>>>>>>>>>> use structs.
>>>>>>>>>> 
>>>>>>>>>> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE
>>>>> "naive"
>>>>>>>>>>> timestamps and UTC-normalized WITH TIME ZONE.
>>>>>>>>>>> 
>>>>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
>>>>>>>>>>> interpreted as UTC-normalized, that would force all of these
>>>>> other
>>>>>>>>>>> systems (and more) to serialize their data to be UTC-normalized
>>>>>>> (i.e.
>>>>>>>>>>> calling the equivalent of pandas's tz_localize function) when
>>>>> they
>>>>>>>>>>> convert to Arrow.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Alternatives:
>>>>>>>>>> 
>>>>>>>>>>  - int64
>>>>>>>>>>  - date32+time64
>>>>>>>>>>  - date32+time32
>>>>>>>>>> 
>>>>>>>>>> This seems very harmful to me, and will make data
>>>>>>>>>>> from these systems not accurately representable in Arrow and
>>>>> unable
>>>>>>> to
>>>>>>>>>>> be round-tripped.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Certainly nobody wants to go backwards.
>>>>>>>>>> 
>>>>>>>>>> We need to clarify: how do we store these *common* types -- MySQL
>>>>>>>>>> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE
>>>>> -- in
>>>>>>>>>> Arrow?
>>>>>>>>>> 
>>>>>>>>>> Secondarily, I think: how do we recommend users store *datetimes*
>>>>> in
>>>>>>>>>> Arrow?
>>>>>>>>>> (I'd expect this to be messier, since every system/language uses a
>>>>>>>>>> different byte structure.)
>>>>>>>>>> 
>>>>>>>>>> Perhaps we can make a spreadsheet and look comprehensively at how
>>>>> many
>>>>>>>>>>> use cases would be disenfranchised by requiring UTC
>>>>> normalization
>>>>>>>>>>> always.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hear, hear!
>>>>>>>>>> 
>>>>>>>>>> Can we also poll people to find out how they're storing Instants
>>>>>>> today?
>>>>>>>>>> 
>>>>>>>>>> Enjoy life,
>>>>>>>>>> Adam
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Adam Hooper
>>>>>>>>>> +1-514-882-9694
>>>>>>>>>> http://adamhooper.com
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Adam Hooper
>>>> +1-514-882-9694
>>>> http://adamhooper.com
>>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Reply via email to