Everything is very clear now thanks to your explanation. :) Okay, then I'll fix the issue by making DateDatum timezone-neutral and TimeDatum UTC-based. Also, I'll play with PostgreSQL to understand its timezone model better.
Meanwhile, please feel free to assign timezone-related bugs to me as you see fit. On Mon, Nov 24, 2014 at 5:52 PM, Hyunsik Choi <[email protected]> wrote: > Thank you all guys for your comments. > > Jaewoong, > > I leave inline comments. If my answers are not enough for your > question or I misunderstood your question, please feel free to ask > additional questions. > > Best regards, > Hyunsik > > On Fri, Nov 21, 2014 at 12:50 AM, Jaewoong Jung <[email protected]> wrote: >> There's another issue that hopefully Hyunsik would be able to clarify, >> and it's very crucial to handling timezones in these data types. >> >> Q: So, let's say (and I agree) TimeDatum represents an instant, so can >> be timezoned. Then, is it a UTC time or a local time? >> > > TimeDatum is UTC or should be UTC if some parts are not. > >> Let me explain why this question is important. >> >> DateDatum represents an instant. And, it is implicitly timezoned to >> the user local time. Which means, if I use '11/20/2014', it's >> '11/20/2014 0:00:00 PST' and is equivalent to a TimestampDatum for >> '11/20/2014 8:00:00 UTC'. (BTW, the compareTo implementation for >> DateDatum is broken in this regard. I'll file a separate issue for >> that.) > > I agree with your proposal. I also think that DateDatum does not need > to be timezoned. We should keep it as DateDatum instead of > TimestampDatum. Thank you for nice finding! > >> What about TimeDatum? It is currently timezoned to UTC, (surprise, >> anyone?) if I understood the code correctly. When we add a TimeDatum >> to a DateDatum, we convert the TimeDatum to the user local time, which >> implies TimeDatum is UTC-based. (Also, the comment next to the line >> explicitly mentions it.) > > You are right. TimeDatum represents UTC time. > > FYI, I'd like to describe additional background. There are only two > entry points to take time or timestamp values. One is records in input > tables, and another is SQL statements. Currently, Input table uses the > system global timezone specified in TajoConf (tajo-site.xml file). > Later, we will add one table property to allow users to specify > timezone for each table. For SQL statement (e.g., SELECT time > '03:00:00'), we will use client timezone. Also, we will provide some > expression to allow users to specify timezone for time or timestamp in > SQL statements. > > Consequently, only two entry points have to deal with timezone for > Timestamp and Time. Other parts in Tajo should deal with all values in > UTC. > >> >> So, here's the problem. >> >> Why do they have different timezones? They're incomplete as an instant >> when used alone and are complementary to each other. This is an >> important concept. To understand it, you have to think about why >> adding DateDatum and TimeDatum is allowed in the code. Originally, >> instants can't be added. (And, that's why I thought TimeDatum is not >> an instant.) You can't add (say) 11/20/2014 to 11/23/2019. Subtracting >> an instant from an instant makes sense and results in a period, but >> they can't be added. >> >> However, in the case of DateDatum and TimeDatum, additions are allowed >> because they're complementary to each other, and what the code does >> **conceptually** is concatenate the two. > > Great insight! So far, I haven't thought it. > >> Therefore, because they're intended to be used together, I'd argue >> they shouldn't have different timezones. Also, if they have different >> timezones, additions can't have a simple correct answer. What's the >> correct answer of 11/20/2014 (PST) + 8:00:00 (UTC)? There's no clear >> answer because they can't be simply concatenated. >> >> (FWIW, the current Tajo code thinks the answer is 11/20/2014 8:00:00 >> in UTC. How many of you got it right?) > > You are definitely right :) If they have different timezone, the > problem becomes very complicated. Nobody wants it :) As I mentioned, > Timezone problem of Timestamp and Time data types should be addressed > in two entry points and client. We need to keep the processing > approach simple. > >> This can cause a lot of confusion to users. When they use a date >> alone, it is interpreted as a local time. But, as soon as they add a >> time to it, it is silently converted to UTC in a way which is very >> unexpected to many users. >> >> Why am I emphasizing it is unexpected? Look at the comparison below. >> >> What's the answer to 11/20/2014 (DateDatum) + 8 hours (IntervalDatum)? >> It's 11/20/2014 8:00 in a local time (PST on my machine). How about >> 11/20/2014 (DateDatum) + 8:00:00 (TimeDatum)? It's 11/20/2014 8:00 in >> UTC as I wrote above. How many users would be able to expect this? > > They are definitely a bug. We follow PostgreSQL in all aspects, and > the following results come from the PostgreSQL. The results of two > operations are the same. > > hyunsik=> SELECT date '11/20/2014' + time '08:00'; > ?column? > --------------------- > 2014-11-20 08:00:00 > (1 row) > > hyunsik=> select date '11/20/2014' + interval '8 hrs'; > ?column? > --------------------- > 2014-11-20 08:00:00 > (1 row) > > >> >> So, coming back to my original question. What timezone should a >> TimeDatum have? UTC or local time? It's currently UTC. But, I believe >> it should be changed to the local time zone. > > TimeDatum is UTC. > > In sum, we should keep both TimeDatum and TimestampDatum UTC values. > Then, we should address timezone offsets in two entry points and > client side. > >> >> (Sorry for the long email. But, I think it's critical to get this >> right and build consensus over it so that we can provide a consistent >> behavior going forward. The actual fix will be really simple, though.) >> >> On Thu, Nov 20, 2014 at 2:44 AM, Hyunsik Choi <[email protected]> wrote: >>> Those parts have poor documentation. >>> >>> I agree with your investigation. I also could find many misuse of timezone >>> in many parts. We should make them clear and fix them in this chance. >>> >>> I just got off the plane, and I'm still on the road. So, I'll give more >>> comments tomorrow. >>> >>> - hyunsik >>> >>> On Thu, Nov 20, 2014 at 5:24 PM, Jaewoong Jung <[email protected]> wrote: >>> >>>> Wow, this seemingly trivial issue has surprisingly many problems involved. >>>> >>>> The most critical one, though, is TimeMeta class. Presumably because >>>> it is poorly documented, it is being used for two different purposes >>>> in Tajo code base. Some code treats it as a date time representation, >>>> which I believe is the original intention, but some treat it as the >>>> human-readable equivalent of TimeDatum by completely ignoring >>>> date-related fields. >>>> >>>> For example, DateTimeUtil.date2j(long julianDate, TimeMeta tm), which >>>> converts a julian timestamp to a TimeMeta value, doesn't touch >>>> dayOfMonth, monthOfYear, or years values and just puts all values for >>>> hour and above units in hours field. >>>> >>>> There are other minor problems like incorrect comments and absent >>>> default values, but the most critical one is misuse of TimeMeta. >>>> >>>> I'll try to break up my patches so that each has a clear and >>>> easy-to-understand goal. Sending this heads-up to let you know >>>> there'll be more issues filed and patches sent than you might expect. >>>> >>>> On Wed, Nov 19, 2014 at 10:56 PM, Jaewoong Jung <[email protected]> wrote: >>>> > Yeah, after some more research, I found that TImeDatum is a somewhat >>>> > ambiguous data type. >>>> > >>>> > Its original purpose is to represent a time of a day, i.e. hh:mm:ss >>>> > part of an instant. So, it could be viewed as instant data though some >>>> > may argue it's incomplete to fully represent an instant. Anyway, given >>>> > that TimestampDatum has limitation in terms of the time range it can >>>> > represent, and given that Tajo doesn't have DateTime data type, >>>> > clients should be allowed to use it with a timezone. >>>> > >>>> > I'll change the direction and try to address the issue by fixing the >>>> > underflow error. >>>> > >>>> > Thanks for the input. :) >>>> > >>>> > On Wed, Nov 19, 2014 at 10:24 PM, Jong-young Park <[email protected]> >>>> wrote: >>>> >> Hi, Jaewoong. >>>> >> >>>> >> To express time period value, IntervalDatum is existing as I know. >>>> >> >>>> >> So I think it is right that TimeDatum is for some time point. >>>> >> >>>> >> And TimestampDatum seems it is doing both roles of DateDatum and >>>> TimeDatum. >>>> >> >>>> >> Regards, >>>> >> Jongyoung. >>>> >> >>>> >> On Wed Nov 19 2014 at 오후 5:23:40 Jaewoong Jung <[email protected]> >>>> wrote: >>>> >> >>>> >>> It turns out, TAJO-1191 is slightly more complicated than I thought. >>>> >>> (https://issues.apache.org/jira/browse/TAJO-1191) >>>> >>> >>>> >>> Basically, it's about whether TimeDatum may have a timezone tied with >>>> >>> it. I **believe** TimeDatum is originally designed to hold a time >>>> >>> period value, not an instant. (TimestampDatum seems to be the >>>> >>> canonical container for instants.) So, it doesn't make sense to apply >>>> >>> any timezones to TimeDatum values, but it's being done in a few >>>> >>> places. And, that's why the test is failing on my machines. >>>> >>> >>>> >>> I'm going to try to fix it by removing all timezone-related references >>>> >>> around the class, but I want to check my assumption with you before I >>>> >>> proceed. >>>> >>> >>>> >>> What do you think about it? >>>> >>> >>>>
