[ 
https://issues.apache.org/jira/browse/DRILL-8101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Turton updated DRILL-8101:
--------------------------------
    Fix Version/s:     (was: 1.21.0)

> Resolve the TIMESTAMP madness
> -----------------------------
>
>                 Key: DRILL-8101
>                 URL: https://issues.apache.org/jira/browse/DRILL-8101
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.19.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> Drill's TIMESAMP type tries to be two different things at the same time, 
> causing incorrect results when the two interpretations collide.
> Drill has the classic DATE and TIME data types. A DATE is just that: a day 
> wherever you happen to be. Your birthday goes from midnight to midnight in 
> the time zone where you find yourself. If you happen to travel around the 
> world, you can make your birthday last almost 48 hours as midnight of your 
> birthday starts at the international date line, circles the globe, followed 
> by the midnight of the next day.
> Similarly, a time is a time where you are. 12:00PM is noon (more-or-less) as 
> determined by the sun. 12:00PM occurs once in every time zone every day. 
> Since there are many time zones, there are many noons each day.
> These are both examples of local time. Most databases combine these two ideas 
> to get a DATETIME: a date and time wherever you are.
> In our modern world, knowing something occurred on 2022-01-02 12:00:00 is not 
> good enough. Did it occur at that time in my time zone or yours? If the event 
> is a user login, or a network breach, then it occurred once, at a specific 
> time, it did not occur many times: once in each time zone. Hence, machines 
> often use UTC time to coordinate.
> Unix-like systems also define the idea of a "timestamp", the number of 
> seconds (or milliseconds or nanoseconds) since 1970-01-01 00:00:00. This is 
> the time reported by Java in the {{System.currentTime()}} function. It is the 
> time most often found in machine-generated logs. It may be as a number (ms 
> since the epoch) or as an ISO-formatted string.
> Thus, users of Drill would expect to find a "timestamp" type that represents 
> a UTC timestamp in Unix format. The will be disappointed, however.
> Drill's TIMESTAMP type is essentially a DATETIME type: it is a date/time in 
> an unspecified timezone and that zone can be whatever you want it to be. UTC? 
> Fine. Local? OK. Nairobi? Sure, why not.
> This works fine as long as _all_ your data is in the same time zone, and you 
> don't need a concept of "now". As described in DRILL-8099 and DRILL-8100, 
> this is how the authors of CTAS thought of it: read Parquet data straight 
> into Drill with no conversion, then write it back out to JSON with no 
> conversion. Both work with UTC, so the result is fine: who cares that the 
> 32-bit number, when in Drill, had no implied time zone? It is just a number 
> we read then write. All good.
> It is even possible to compute the difference of two DATETIMEs with 
> unspecified time zone: that's what an INTERVAL does. As long as the times are 
> actually in the same zone (UTC, say, or local, or Nairobi), then all is fine.
> Everything collapses, however, when someone wants to know, "but how long ago 
> was that event"? "Long enough ago that I need to raise the escalation level?" 
> Drill has the INTERVAL type to give us the difference, but how do I get 
> "now"? Drill has {{CURRENT_TIMESTAMP}}. But, how we have a problem, what 
> timezone is that time in? UTC? My local timezone? Nairobi? And, what if my 
> data is UTC but {{CURRENT_TIMESTAMP}} is local? Or visa-versa? The whole 
> house of cards comes crashing down.
> Over the years, this bug has appeared again and again. Sometimes people 
> change the logic to assume TIMESTAMP is UTC. Sometimes things are changed to 
> assume TIMESTAMP is local time (I've been guilty of this). Sometimes we just 
> punt, and require that the machine (or test) run only in UTC, since that's 
> the only place the two systems coincide.
> But, in fact, I believe that the original designers of Drill meant TIMESTAMP 
> to have _no_ timezone: two TIMESTAMP values could be in entirely different 
> (unknown) timezones! One can see vestiges of this in the value vector code. 
> It seems the original engineers imagined a "TIMESTAMP_WITH_ZONE" type, 
> similar to Java's (or Joda's) {{ZonedDateTime}} type. Other bits of code 
> (Parquet) refers to a never-built "TIMESTAMPZ" type for a UTC timestamp. When 
> faced with the {{CURRENT_TIMESTAMP}} issue, fixes started down the path of 
> saying that TIMESTAMP is local time, but this is probably a misunderstanding 
> of the original design, forced upon us by the gaps in that original design.
> Further, each time we make a change (such as DRILL-8099 and DRILL-8100), we 
> change behavior, potentially breaking a kludge that someone found to 
> kinda-sorta make things work.
> Since computers can't deal with ambiguity the way humans can, we need a 
> solution. It is not good enough for you to think "TIMESTAMP is UTC" and me to 
> think "TIMESTAMP is local" and for Bob to think "TIMESTAMP is Java's 
> {{LocalDateTime}}, it has no zone." The software needs to work one way. 
> (Software is a pain that way.)
> One way to solve the issue is to introduce two new types: DATETIME and 
> UTC_TIMESTAMP. DATETIME has no timezone. It is not local time: it is an 
> unknown, unspecified timezone, as in all other SQL tools. It is noon on your 
> birthday, wherever you happen to be. UTC_TIMESTAMP is a Unix-like timestamp.
> We then need functions to convert. Here, we can just follow Java's 
> {{java.time}} package: to compare a UTC_TIMESTAMP with a DATETIME, you have 
> to specify a timezone.
> This will leave the current TIMESTAMP to settle down into a LOCAL_TIMESTAMP, 
> which will turn out to be useless and will (we can hope) fade away.
> All readers and writers that work with data that is known to be UTC (such as 
> Parquet) will use UTC_TIMESTAMP. Where the zone is not known (JDBC DATETIME), 
> then the data type will be DATETIME.
> The functions that deal with the current time need new versions: "NOW" for 
> "CURRENT_UTC_TIMESTAMP". The "AGE" function would use a UTC baseline for 
> UTC_TIMESTAMP, perhaps local time for a DATETIME argument.
> A design should spell out the details. Since time handing is HARD, the design 
> should lean heavily on the {{java.time}} package and just follow the enormous 
> work that went into that package.
> After years of fighting this beast, I've come to the conclusion that only the 
> two new types (and the deprecation and gradual fading away of the "local" 
> TIMESTAMP type) can finally resolve our never-ending stream of bugs in this 
> area.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to