Thanks, Weston. I guess I’m in the camp that wants Instant to be a first-class type, on a par with ZonedDateTime and LocalDateTime. I have added my comments to the “Arguments for” section in the 2nd doc.
> On Jun 22, 2021, at 10:38 AM, Weston Pace <weston.p...@gmail.com> wrote: > > There's two (related) proposals that are being discussed I think. I > created streamline documents for each. Feel free to add arguments > for/against or to clarify. If no one has any better ideas I'll start > a vote topic for each one in ~48 hours. > > # Proposal: Clarify meaning of timestamp without time zone > * > https://docs.google.com/document/d/1wDAuxEDVo3YxZx20fGUGqQxi3aoss7TJ-TzOUjaoZk8/edit?usp=sharing > > # Proposal: Arrow should define how an “Instant” is stored > * > https://docs.google.com/document/d/1xEKRhs-GUSMwjMhgmQdnCNMXwZrA10226AcXRoP8g9E/edit?usp=sharing > > > On Tue, Jun 22, 2021 at 6:01 AM Julian Hyde <jhyde.apa...@gmail.com> wrote: >> >> My proposal is that Arrow should support three different kinds of >> date-times: zoneless, zoned, and instant. (Not necessarily with those names.) >> >> All three kinds occur frequently in the industry. >> >> Many systems only have two, and users of those systems have figured out how >> to make do. (For example, you can implement an instant using a zoneless, and >> vice versa, if you are careful.) But let’s suppose that Arrow has two of the >> three, and needs to interoperate with a system that has a different two of >> the three. Chaos ensues. >> >> Let’s just implement all three. >> >> Julian >> >> >> >> >>> On Jun 22, 2021, at 8:46 AM, Wes McKinney <wesmck...@gmail.com> wrote: >>> >>> Let's see a streamlined document about what we are voting on — I have >>> limited bandwidth to read through and synthesize the discussion myself >>> and I am probably not the only one. It has always been my >>> understanding to represent time elapsed from the UNIX epoch >>> (1970-01-01 00:00:00 UTC) as a timestamp with tz=UTC (which I believe >>> is what is being called an "instant"). >>> >>> On Tue, Jun 22, 2021 at 9:01 AM Adam Hooper <a...@adamhooper.com> wrote: >>>> >>>> Maybe Arrow should add a new type, "INSTANT"? >>>> >>>> Instant (seconds since the epoch) is the gold standard in storing moments >>>> in time. All programming languages; all RDBMSs; Parquet ... everybody uses >>>> this. >>>> >>>> I use Instants, too. I interpreted TIMESTAMP with no metadata to mean >>>> Instant because I read the docs *assuming* Arrow stores Instant. >>>> >>>> I know, I know, no vote can stop me from interpreting "timestamp without >>>> timezone" however the heck I want. But it's painful for me to transition >>>> from happy user to heretic. >>>> >>>> Voting to clarify that Arrow doesn't store Instants is voting to clarify >>>> that Arrow *doesn't* do something extremely useful. It's voting for a >>>> negative. That sounds painful! What if there were positives to vote for? An >>>> "INSTANT" type? A new TIMESTAMP metadata field, "instant" (on by default)? >>>> A fiat that timezone=UTC means Instant, not ZonedDateTime? >>>> >>>> Enjoy life, >>>> Adam >>>> >>>> On Mon, Jun 21, 2021 at 11:39 PM Weston Pace <weston.p...@gmail.com> wrote: >>>> >>>>> I agree that a vote would be a good idea. Do you want to start a >>>>> dedicated vote thread? I can write one up too if you'd rather. >>>>> >>>>> -Weston >>>>> >>>>> On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <emkornfi...@gmail.com> >>>>> wrote: >>>>>> >>>>>> I think comments on the doc are tailing off. Jorge's test cases I think >>>>>> still need some more careful analysis but Weston has provided an >>>>>> initial pass. >>>>>> >>>>>> The matter not resolved on the document is whether Timestamp with >>>>> timezone >>>>>> logically represents multi-field date and time (that does not represent a >>>>>> specific instant) or whether it logically represents an instant (some >>>>>> measurable offset from an epoch). Based on comments on the >>>>> documentation >>>>>> both C++/Python implementations and the Java implementations (those that >>>>>> have generally been considered "reference") both have evidence the the >>>>>> former representation is what is intended (some links are in the >>>>> document). >>>>>> >>>>>> We can probably continue to debate what is useful but it seems ultimately >>>>>> we need to pick one or the other and clarify the specification. Given >>>>> how >>>>>> the reference implementations currently work I think we should error on >>>>> the >>>>>> side of interpreting these values as date times. Ultimately, given the >>>>>> contention here we will likely need to vote on this. >>>>>> >>>>>> More comments on the document or here are still useful in case we've >>>>> missed >>>>>> an interpretation or there are other facts to consider. >>>>>> >>>>>> Cheers, >>>>>> Micah >>>>>> >>>>>> On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão < >>>>>> jorgecarlei...@gmail.com> wrote: >>>>>> >>>>>>> Thank you everyone for participating so far; really important and >>>>>>> useful discussion. >>>>>>> >>>>>>> I think of this discussion as a set of test cases over behavior: >>>>>>> >>>>>>> parameterization: >>>>>>> * Timestamp(ms, None) >>>>>>> * Timestamp(ms, "00:00") >>>>>>> * Timestamp(ms, "01:00") >>>>>>> >>>>>>> Cases: >>>>>>> * its string representation equals to >>>>>>> * add a duration equals to >>>>>>> * add an interval equals to >>>>>>> * subtract a Timestamp(ms, None) equals to >>>>>>> * subtract a Timestamp(ms, "01:00") equals to >>>>>>> * subtract a Date32 equals to >>>>>>> * subtract a Time32(ms) equals to >>>>>>> * extract the day equals to >>>>>>> * extract the timezone equals to >>>>>>> * cast to Timestamp(ms, None) equals to >>>>>>> * cast to Timestamp(ms, "01:00") equals to >>>>>>> * write to parquet v2 equals to (physical value and logical type) >>>>>>> >>>>>>> In all cases, the result may either be valid or invalid. If valid, we >>>>>>> would need a datatype and an actual value. >>>>>>> I was hoping to be able to answer each of the above at the end of this >>>>>>> discussion. >>>>>>> >>>>>>> I've suggested adding these in the google docs. >>>>>>> >>>>>>> Best, >>>>>>> Jorge >>>>>>> >>>>>>> On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield < >>>>> emkornfi...@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> I've posted the examples above in >>>>>>>> >>>>>>> >>>>> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing >>>>>>>> because I think it would be better to collaborate there instead of >>>>> linear >>>>>>>> e-mail history and then bring the consensus back to the list. >>>>>>>> >>>>>>>> On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield < >>>>> emkornfi...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I feel like we might still be talking past each other here or at >>>>> least >>>>>>> I >>>>>>>>> don't understand the two sides of this. I'll try to expand >>>>> Weston's >>>>>>>>> example because I think it provides the best clarification. >>>>>>>>> >>>>>>>>> (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000, >>>>>>> assuming >>>>>>>>> ms) for a timestamp column without timezone (always). This >>>>>>> represents an >>>>>>>>> offset from the unix epoch. This interpretation should not change >>>>>>> based on >>>>>>>>> the local system timezone. Extracting the hour field always >>>>> yields 14 >>>>>>>>> (extraction is done relative to UTC). >>>>>>>>> >>>>>>>>> The alternative here seems to be that we can encode (1970, 1, 2, >>>>> 14, >>>>>>> 0) in >>>>>>>>> multiple different ways depending on what the current local system >>>>> time >>>>>>>>> is. As a note, I think ORC and Spark do this, and it leads to >>>>>>>>> confusion/misinterpretation when trying to transfer data. >>>>>>>>> >>>>>>>>> If we then convert this column to a timestamp with a timezone in >>>>> "UTC" >>>>>>>>> timezone extracting the hour field still yields 14. If the column >>>>> is >>>>>>>>> converted to Timezone with timestamp PST. Extracting an hour would >>>>>>> yield 6 >>>>>>>>> (assume PST = -8GMT). Through all of these changes the data >>>>> bits do >>>>>>> not >>>>>>>>> change. >>>>>>>>> >>>>>>>>> Display is not mentioned because I think the points about how a >>>>> time >>>>>>>>> display is correct. Applications can choose what they feel makes >>>>> sense >>>>>>> to >>>>>>>>> them (as long as they don't start automatically tacking on >>>>> timezones to >>>>>>>>> naive timestamps). My interpretation of the specification has been >>>>>>> display >>>>>>>>> was kind of shorthand for field extraction. >>>>>>>>> >>>>>>>>> Could others on the thread confirm this is the issue up for debate? >>>>>>> Are >>>>>>>>> there subtleties/operations we need to consider? >>>>>>>>> >>>>>>>>> I also agree that we should document recommended conversion >>>>> practices >>>>>>> from >>>>>>>>> other systems. >>>>>>>>> >>>>>>>>> -Micah >>>>>>>>> >>>>>>>>> >>>>>>>>> So let's invent a third way. I could use >>>>>>>>>> the first 16 bits for the year, the next 8 bits for the month, the >>>>>>>>>> next 8 bits for the day of month, the next 8 bits for the hour, >>>>> the >>>>>>>>>> next 8 bits for the minute, and the remaining bits for the >>>>> seconds. >>>>>>>>>> Using this method I would store (1970, 1, 2, 14, 0) as >>>>>>>>>> 0x07B201020E000000. >>>>>>>>> >>>>>>>>> Aside, With some small variation this is what ZetaSql uses [2] >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> >>>>>>> >>>>> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark >>>>>>>>> [2] >>>>>>>>> >>>>>>> >>>>> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <a...@adamhooper.com> >>>>>>> wrote: >>>>>>>>> >>>>>>>>>> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmck...@gmail.com >>>>>> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The SQL standard (e.g. PostgresSQL) has two timestamp types: >>>>>>>>>>> with/without time zone — in some SQL implementations each slot >>>>> can >>>>>>>>>>> have a different time zone >>>>>>>>>>> https://www.postgresql.org/docs/9.1/datatype-datetime.html >>>>>>>>>>> WITHOUT TIME ZONE: "timestamp without time zone value should be >>>>>>> taken >>>>>>>>>>> or given as timezone local time" >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> RDBMSs conflict (universally) with ANSI. >>>>>>>>>> >>>>>>>>>> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since >>>>> the >>>>>>> epoch. >>>>>>>>>> It has no timezone. >>>>>>>>>> >>>>>>>>>> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int >>>>>>> Instant >>>>>>>>>> since the epoch. It has no timezone. >>>>>>>>>> >>>>>>>>>> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive >>>>>>> datetime" in >>>>>>>>>> *function*, but not in implementation: >>>>>>>>>> >>>>>>>>>> - MySQL DATETIME >>>>>>>>>> < >>>>>>>>>> >>>>>>> >>>>> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html >>>>>>>>>>> >>>>>>>>>> is weird: 1-bit sign, 17-bit month, 5-bit day, .... >>>>>>>>>> - MSSQL >>>>>>>>>> < >>>>>>>>>> >>>>>>> >>>>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15 >>>>>>>>>>> >>>>>>>>>> uses 6, 7 or 8 bytes >>>>>>>>>> - PostgreSQL stores an integer, but I think its epoch is still >>>>>>>>>> different >>>>>>>>>> < >>>>>>>>>> >>>>>>> >>>>> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h >>>>>>>>>>> >>>>>>>>>> (it >>>>>>>>>> used to store doubles since 2000-01-01) >>>>>>>>>> >>>>>>>>>> ... so in general, moving datetimes from these systems into 64-bit >>>>>>>>>> integers >>>>>>>>>> is nontrivial and lossy. >>>>>>>>>> >>>>>>>>>> Spark / Databricks discusses how Spark handles this >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps >>>>>>>>>>> * WITHOUT TIME ZONE: "These timestamps are not bound to any time >>>>>>> zone, >>>>>>>>>>> and are wall clock timestamps." — not UTC-normalized >>>>>>>>>>> * WITH TIME ZONE: "does not affect the physical point in time >>>>> that >>>>>>> the >>>>>>>>>>> timestamp represents, as that is fully represented by the UTC >>>>> time >>>>>>>>>>> instant given by the other timestamp components" >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I don't use Spark, but I read that page twice. First reading, I >>>>> got >>>>>>> the >>>>>>>>>> same thing out of it. But the second time I read it, I read the >>>>>>> opposite! >>>>>>>>>> >>>>>>>>>> The key part is: "*Spark SQL defines the timestamp type as >>>>> TIMESTAMP >>>>>>> WITH >>>>>>>>>> SESSION TIME ZONE*," -- in other words, Spark doesn't have a >>>>> TIMESTAMP >>>>>>>>>> WITH >>>>>>>>>> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one >>>>> Timestamp >>>>>>>>>> type: >>>>>>>>>> a 64-bit Instant since the epoch. (It also has a Date type.) >>>>>>>>>> >>>>>>>>>> If I'm reading correctly, this is exactly the same as PostgreSQL >>>>>>> TIMESTAMP >>>>>>>>>> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and >>>>> transmit >>>>>>>>>> timestamps as bare 64-bit integers since the epoch -- without >>>>>>> timezone. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be >>>>>>>>>>> interpreted as UTC-normalized, that would force all of these >>>>> other >>>>>>>>>>> systems (and more) to serialize their data to be UTC-normalized >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Are those systems' 64-bit integers interoperable in the first >>>>> place? >>>>>>>>>> >>>>>>>>>> As I understand it, there's a ton of variance out there when >>>>> encoding >>>>>>>>>> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres >>>>>>> encoding >>>>>>>>>> is one of many. As I mentioned in another thread, programming >>>>>>> languages >>>>>>>>>> all >>>>>>>>>> use structs. >>>>>>>>>> >>>>>>>>>> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE >>>>> "naive" >>>>>>>>>>> timestamps and UTC-normalized WITH TIME ZONE. >>>>>>>>>>> >>>>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be >>>>>>>>>>> interpreted as UTC-normalized, that would force all of these >>>>> other >>>>>>>>>>> systems (and more) to serialize their data to be UTC-normalized >>>>>>> (i.e. >>>>>>>>>>> calling the equivalent of pandas's tz_localize function) when >>>>> they >>>>>>>>>>> convert to Arrow. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Alternatives: >>>>>>>>>> >>>>>>>>>> - int64 >>>>>>>>>> - date32+time64 >>>>>>>>>> - date32+time32 >>>>>>>>>> >>>>>>>>>> This seems very harmful to me, and will make data >>>>>>>>>>> from these systems not accurately representable in Arrow and >>>>> unable >>>>>>> to >>>>>>>>>>> be round-tripped. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Certainly nobody wants to go backwards. >>>>>>>>>> >>>>>>>>>> We need to clarify: how do we store these *common* types -- MySQL >>>>>>>>>> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE >>>>> -- in >>>>>>>>>> Arrow? >>>>>>>>>> >>>>>>>>>> Secondarily, I think: how do we recommend users store *datetimes* >>>>> in >>>>>>>>>> Arrow? >>>>>>>>>> (I'd expect this to be messier, since every system/language uses a >>>>>>>>>> different byte structure.) >>>>>>>>>> >>>>>>>>>> Perhaps we can make a spreadsheet and look comprehensively at how >>>>> many >>>>>>>>>>> use cases would be disenfranchised by requiring UTC >>>>> normalization >>>>>>>>>>> always. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hear, hear! >>>>>>>>>> >>>>>>>>>> Can we also poll people to find out how they're storing Instants >>>>>>> today? >>>>>>>>>> >>>>>>>>>> Enjoy life, >>>>>>>>>> Adam >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Adam Hooper >>>>>>>>>> +1-514-882-9694 >>>>>>>>>> http://adamhooper.com >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>>> >>>> >>>> -- >>>> Adam Hooper >>>> +1-514-882-9694 >>>> http://adamhooper.com >>