@hyunsik, It's good new. :) 2014-11-23 23:30 GMT+09:00 Hyunsik Choi <[email protected]>:
> (Actually, I had suffered from some mailing list rejection problem for > few days. This problem was caused by the change of committer shell > access policy. Finally, I fixed the problem. I'm not sure if you > received the following mail. So, I send it again.) > > Hi Jaewoong, > > Today, I investigated the problem in more deep. I found the problem > about misuse of timezone. Recently, I've also suffered from this > problem while I'm working tajo-client. Actually, the recap of datetime > types were contributed by one contributor, and I reviewed the codes. > Since I overlooked timezone-related parts, it's my fault consequently. > Anyway, we can fix it. I'm very happy for us to fix this problem with > your help. > > The main cause of this problem is that Datum-level data structures > deal with timezone directly. > > First of all, in this email, I'm going to give the background in more > detail; I think that this background includes many answers of your > questions. Then, I'll give the answers of your questions in next > email. > > As you may know, In SQL, there are two kinds of data types: one is > without tiemzone and another is without timezone. > > SQL standard types without timezone: > * Timestamp > * Time > > SQL standard types with timezone: > * Timestampz > * Timez > > Each timestampz or Timez value has timezone offset itself. Both types > are straightforward. Also, Date type does not have 'with timezone' > type. > > BTW, 'types without timezone' (e.g., Timestamp and Time) can be > affected by an external timezone in almost all SQL systems that I've > seen. In other words, their timezone can be present in outside of data > values. By default, many systems use the timezone of local machine. > Similarly, Tajo uses a static variable 'TajoConf.CURRENT_TIMEZONE'. > > The main issue is when data values are adjusted by timezone. The > adjustment should be performed before Datum is created. In other > words, in scanner, Timestamp or Time value should be adjusted by > timezone. TimestampDatum or TimeDatum must be with UTC+0. But, in the > current implementation, some parts of TimestampDatum and TimeDatum > address directly timezone. It's wrong in my opinion. Also, all unit > tests for TimestampDatum and TimeDatum should be performed with UTC. > > In addition, timestamp or time values should be adjusted one more by > 'user timezone' when an user gets the query results. In sum, the flow > is as follows: > > Timestamp or Time value in an Input table ---(adjust by an external > timezone or local timezone)---> TimestampDatum or TimeDatum with UTC > ---> query processing ---(adjust by an user timezone)-----> Print > > The reason why TimestampDatum and TimeDaum values must be addressed in > UTC is the simplicity of processing codes. In many cases, two or more > tables used in one query can have multiple different timezones. If all > values are in UTC, the processing does not need to consider different > timezones. But, the current implementation has one global timezone. We > should fix it, and I created the jira issue for it at > https://issues.apache.org/jira/browse/TAJO-1186. > > Best regards, > Hyunsik > > On Fri, Nov 21, 2014 at 12:50 AM, Jaewoong Jung <[email protected]> wrote: > > There's another issue that hopefully Hyunsik would be able to clarify, > > and it's very crucial to handling timezones in these data types. > > > > Q: So, let's say (and I agree) TimeDatum represents an instant, so can > > be timezoned. Then, is it a UTC time or a local time? > > > > Let me explain why this question is important. > > > > DateDatum represents an instant. And, it is implicitly timezoned to > > the user local time. Which means, if I use '11/20/2014', it's > > '11/20/2014 0:00:00 PST' and is equivalent to a TimestampDatum for > > '11/20/2014 8:00:00 UTC'. (BTW, the compareTo implementation for > > DateDatum is broken in this regard. I'll file a separate issue for > > that.) > > > > What about TimeDatum? It is currently timezoned to UTC, (surprise, > > anyone?) if I understood the code correctly. When we add a TimeDatum > > to a DateDatum, we convert the TimeDatum to the user local time, which > > implies TimeDatum is UTC-based. (Also, the comment next to the line > > explicitly mentions it.) > > > > So, here's the problem. > > > > Why do they have different timezones? They're incomplete as an instant > > when used alone and are complementary to each other. This is an > > important concept. To understand it, you have to think about why > > adding DateDatum and TimeDatum is allowed in the code. Originally, > > instants can't be added. (And, that's why I thought TimeDatum is not > > an instant.) You can't add (say) 11/20/2014 to 11/23/2019. Subtracting > > an instant from an instant makes sense and results in a period, but > > they can't be added. > > > > However, in the case of DateDatum and TimeDatum, additions are allowed > > because they're complementary to each other, and what the code does > > **conceptually** is concatenate the two. > > > > Therefore, because they're intended to be used together, I'd argue > > they shouldn't have different timezones. Also, if they have different > > timezones, additions can't have a simple correct answer. What's the > > correct answer of 11/20/2014 (PST) + 8:00:00 (UTC)? There's no clear > > answer because they can't be simply concatenated. > > > > (FWIW, the current Tajo code thinks the answer is 11/20/2014 8:00:00 > > in UTC. How many of you got it right?) > > > > This can cause a lot of confusion to users. When they use a date > > alone, it is interpreted as a local time. But, as soon as they add a > > time to it, it is silently converted to UTC in a way which is very > > unexpected to many users. > > > > Why am I emphasizing it is unexpected? Look at the comparison below. > > > > What's the answer to 11/20/2014 (DateDatum) + 8 hours (IntervalDatum)? > > It's 11/20/2014 8:00 in a local time (PST on my machine). How about > > 11/20/2014 (DateDatum) + 8:00:00 (TimeDatum)? It's 11/20/2014 8:00 in > > UTC as I wrote above. How many users would be able to expect this? > > > > So, coming back to my original question. What timezone should a > > TimeDatum have? UTC or local time? It's currently UTC. But, I believe > > it should be changed to the local time zone. > > > > (Sorry for the long email. But, I think it's critical to get this > > right and build consensus over it so that we can provide a consistent > > behavior going forward. The actual fix will be really simple, though.) > > > > On Thu, Nov 20, 2014 at 2:44 AM, Hyunsik Choi <[email protected]> > wrote: > >> Those parts have poor documentation. > >> > >> I agree with your investigation. I also could find many misuse of > timezone > >> in many parts. We should make them clear and fix them in this chance. > >> > >> I just got off the plane, and I'm still on the road. So, I'll give more > >> comments tomorrow. > >> > >> - hyunsik > >> > >> On Thu, Nov 20, 2014 at 5:24 PM, Jaewoong Jung <[email protected]> > wrote: > >> > >>> Wow, this seemingly trivial issue has surprisingly many problems > involved. > >>> > >>> The most critical one, though, is TimeMeta class. Presumably because > >>> it is poorly documented, it is being used for two different purposes > >>> in Tajo code base. Some code treats it as a date time representation, > >>> which I believe is the original intention, but some treat it as the > >>> human-readable equivalent of TimeDatum by completely ignoring > >>> date-related fields. > >>> > >>> For example, DateTimeUtil.date2j(long julianDate, TimeMeta tm), which > >>> converts a julian timestamp to a TimeMeta value, doesn't touch > >>> dayOfMonth, monthOfYear, or years values and just puts all values for > >>> hour and above units in hours field. > >>> > >>> There are other minor problems like incorrect comments and absent > >>> default values, but the most critical one is misuse of TimeMeta. > >>> > >>> I'll try to break up my patches so that each has a clear and > >>> easy-to-understand goal. Sending this heads-up to let you know > >>> there'll be more issues filed and patches sent than you might expect. > >>> > >>> On Wed, Nov 19, 2014 at 10:56 PM, Jaewoong Jung <[email protected]> > wrote: > >>> > Yeah, after some more research, I found that TImeDatum is a somewhat > >>> > ambiguous data type. > >>> > > >>> > Its original purpose is to represent a time of a day, i.e. hh:mm:ss > >>> > part of an instant. So, it could be viewed as instant data though > some > >>> > may argue it's incomplete to fully represent an instant. Anyway, > given > >>> > that TimestampDatum has limitation in terms of the time range it can > >>> > represent, and given that Tajo doesn't have DateTime data type, > >>> > clients should be allowed to use it with a timezone. > >>> > > >>> > I'll change the direction and try to address the issue by fixing the > >>> > underflow error. > >>> > > >>> > Thanks for the input. :) > >>> > > >>> > On Wed, Nov 19, 2014 at 10:24 PM, Jong-young Park < > [email protected]> > >>> wrote: > >>> >> Hi, Jaewoong. > >>> >> > >>> >> To express time period value, IntervalDatum is existing as I know. > >>> >> > >>> >> So I think it is right that TimeDatum is for some time point. > >>> >> > >>> >> And TimestampDatum seems it is doing both roles of DateDatum and > >>> TimeDatum. > >>> >> > >>> >> Regards, > >>> >> Jongyoung. > >>> >> > >>> >> On Wed Nov 19 2014 at 오후 5:23:40 Jaewoong Jung <[email protected]> > >>> wrote: > >>> >> > >>> >>> It turns out, TAJO-1191 is slightly more complicated than I > thought. > >>> >>> (https://issues.apache.org/jira/browse/TAJO-1191) > >>> >>> > >>> >>> Basically, it's about whether TimeDatum may have a timezone tied > with > >>> >>> it. I **believe** TimeDatum is originally designed to hold a time > >>> >>> period value, not an instant. (TimestampDatum seems to be the > >>> >>> canonical container for instants.) So, it doesn't make sense to > apply > >>> >>> any timezones to TimeDatum values, but it's being done in a few > >>> >>> places. And, that's why the test is failing on my machines. > >>> >>> > >>> >>> I'm going to try to fix it by removing all timezone-related > references > >>> >>> around the class, but I want to check my assumption with you > before I > >>> >>> proceed. > >>> >>> > >>> >>> What do you think about it? > >>> >>> > >>> >
