Hi Jaewoong and CharSyam, I submitted the patch that fixes timezone-related problems discussed here. I rearranged all time zone usages throughout Tajo. https://issues.apache.org/jira/browse/TAJO-1234
Thanks to the discussion of this mailing list with you guys, I was clear enough to find or fix the problems. Warm regards, Hyunsik On Tue, Nov 25, 2014 at 5:54 PM, Hyunsik Choi <[email protected]> wrote: > I'm happy to work with you and fix a major nuisance :) Later, I'll > share the timezone related problem with you when I found additional > bugs. > > Warm regards, > Hyunsik > > On Tue, Nov 25, 2014 at 4:17 PM, Jaewoong Jung <[email protected]> wrote: >> Everything is very clear now thanks to your explanation. :) >> >> Okay, then I'll fix the issue by making DateDatum timezone-neutral and >> TimeDatum UTC-based. Also, I'll play with PostgreSQL to understand its >> timezone model better. >> >> Meanwhile, please feel free to assign timezone-related bugs to me as >> you see fit. >> >> On Mon, Nov 24, 2014 at 5:52 PM, Hyunsik Choi <[email protected]> wrote: >>> Thank you all guys for your comments. >>> >>> Jaewoong, >>> >>> I leave inline comments. If my answers are not enough for your >>> question or I misunderstood your question, please feel free to ask >>> additional questions. >>> >>> Best regards, >>> Hyunsik >>> >>> On Fri, Nov 21, 2014 at 12:50 AM, Jaewoong Jung <[email protected]> wrote: >>>> There's another issue that hopefully Hyunsik would be able to clarify, >>>> and it's very crucial to handling timezones in these data types. >>>> >>>> Q: So, let's say (and I agree) TimeDatum represents an instant, so can >>>> be timezoned. Then, is it a UTC time or a local time? >>>> >>> >>> TimeDatum is UTC or should be UTC if some parts are not. >>> >>>> Let me explain why this question is important. >>>> >>>> DateDatum represents an instant. And, it is implicitly timezoned to >>>> the user local time. Which means, if I use '11/20/2014', it's >>>> '11/20/2014 0:00:00 PST' and is equivalent to a TimestampDatum for >>>> '11/20/2014 8:00:00 UTC'. (BTW, the compareTo implementation for >>>> DateDatum is broken in this regard. I'll file a separate issue for >>>> that.) >>> >>> I agree with your proposal. I also think that DateDatum does not need >>> to be timezoned. We should keep it as DateDatum instead of >>> TimestampDatum. Thank you for nice finding! >>> >>>> What about TimeDatum? It is currently timezoned to UTC, (surprise, >>>> anyone?) if I understood the code correctly. When we add a TimeDatum >>>> to a DateDatum, we convert the TimeDatum to the user local time, which >>>> implies TimeDatum is UTC-based. (Also, the comment next to the line >>>> explicitly mentions it.) >>> >>> You are right. TimeDatum represents UTC time. >>> >>> FYI, I'd like to describe additional background. There are only two >>> entry points to take time or timestamp values. One is records in input >>> tables, and another is SQL statements. Currently, Input table uses the >>> system global timezone specified in TajoConf (tajo-site.xml file). >>> Later, we will add one table property to allow users to specify >>> timezone for each table. For SQL statement (e.g., SELECT time >>> '03:00:00'), we will use client timezone. Also, we will provide some >>> expression to allow users to specify timezone for time or timestamp in >>> SQL statements. >>> >>> Consequently, only two entry points have to deal with timezone for >>> Timestamp and Time. Other parts in Tajo should deal with all values in >>> UTC. >>> >>>> >>>> So, here's the problem. >>>> >>>> Why do they have different timezones? They're incomplete as an instant >>>> when used alone and are complementary to each other. This is an >>>> important concept. To understand it, you have to think about why >>>> adding DateDatum and TimeDatum is allowed in the code. Originally, >>>> instants can't be added. (And, that's why I thought TimeDatum is not >>>> an instant.) You can't add (say) 11/20/2014 to 11/23/2019. Subtracting >>>> an instant from an instant makes sense and results in a period, but >>>> they can't be added. >>>> >>>> However, in the case of DateDatum and TimeDatum, additions are allowed >>>> because they're complementary to each other, and what the code does >>>> **conceptually** is concatenate the two. >>> >>> Great insight! So far, I haven't thought it. >>> >>>> Therefore, because they're intended to be used together, I'd argue >>>> they shouldn't have different timezones. Also, if they have different >>>> timezones, additions can't have a simple correct answer. What's the >>>> correct answer of 11/20/2014 (PST) + 8:00:00 (UTC)? There's no clear >>>> answer because they can't be simply concatenated. >>>> >>>> (FWIW, the current Tajo code thinks the answer is 11/20/2014 8:00:00 >>>> in UTC. How many of you got it right?) >>> >>> You are definitely right :) If they have different timezone, the >>> problem becomes very complicated. Nobody wants it :) As I mentioned, >>> Timezone problem of Timestamp and Time data types should be addressed >>> in two entry points and client. We need to keep the processing >>> approach simple. >>> >>>> This can cause a lot of confusion to users. When they use a date >>>> alone, it is interpreted as a local time. But, as soon as they add a >>>> time to it, it is silently converted to UTC in a way which is very >>>> unexpected to many users. >>>> >>>> Why am I emphasizing it is unexpected? Look at the comparison below. >>>> >>>> What's the answer to 11/20/2014 (DateDatum) + 8 hours (IntervalDatum)? >>>> It's 11/20/2014 8:00 in a local time (PST on my machine). How about >>>> 11/20/2014 (DateDatum) + 8:00:00 (TimeDatum)? It's 11/20/2014 8:00 in >>>> UTC as I wrote above. How many users would be able to expect this? >>> >>> They are definitely a bug. We follow PostgreSQL in all aspects, and >>> the following results come from the PostgreSQL. The results of two >>> operations are the same. >>> >>> hyunsik=> SELECT date '11/20/2014' + time '08:00'; >>> ?column? >>> --------------------- >>> 2014-11-20 08:00:00 >>> (1 row) >>> >>> hyunsik=> select date '11/20/2014' + interval '8 hrs'; >>> ?column? >>> --------------------- >>> 2014-11-20 08:00:00 >>> (1 row) >>> >>> >>>> >>>> So, coming back to my original question. What timezone should a >>>> TimeDatum have? UTC or local time? It's currently UTC. But, I believe >>>> it should be changed to the local time zone. >>> >>> TimeDatum is UTC. >>> >>> In sum, we should keep both TimeDatum and TimestampDatum UTC values. >>> Then, we should address timezone offsets in two entry points and >>> client side. >>> >>>> >>>> (Sorry for the long email. But, I think it's critical to get this >>>> right and build consensus over it so that we can provide a consistent >>>> behavior going forward. The actual fix will be really simple, though.) >>>> >>>> On Thu, Nov 20, 2014 at 2:44 AM, Hyunsik Choi <[email protected]> >>>> wrote: >>>>> Those parts have poor documentation. >>>>> >>>>> I agree with your investigation. I also could find many misuse of timezone >>>>> in many parts. We should make them clear and fix them in this chance. >>>>> >>>>> I just got off the plane, and I'm still on the road. So, I'll give more >>>>> comments tomorrow. >>>>> >>>>> - hyunsik >>>>> >>>>> On Thu, Nov 20, 2014 at 5:24 PM, Jaewoong Jung <[email protected]> wrote: >>>>> >>>>>> Wow, this seemingly trivial issue has surprisingly many problems >>>>>> involved. >>>>>> >>>>>> The most critical one, though, is TimeMeta class. Presumably because >>>>>> it is poorly documented, it is being used for two different purposes >>>>>> in Tajo code base. Some code treats it as a date time representation, >>>>>> which I believe is the original intention, but some treat it as the >>>>>> human-readable equivalent of TimeDatum by completely ignoring >>>>>> date-related fields. >>>>>> >>>>>> For example, DateTimeUtil.date2j(long julianDate, TimeMeta tm), which >>>>>> converts a julian timestamp to a TimeMeta value, doesn't touch >>>>>> dayOfMonth, monthOfYear, or years values and just puts all values for >>>>>> hour and above units in hours field. >>>>>> >>>>>> There are other minor problems like incorrect comments and absent >>>>>> default values, but the most critical one is misuse of TimeMeta. >>>>>> >>>>>> I'll try to break up my patches so that each has a clear and >>>>>> easy-to-understand goal. Sending this heads-up to let you know >>>>>> there'll be more issues filed and patches sent than you might expect. >>>>>> >>>>>> On Wed, Nov 19, 2014 at 10:56 PM, Jaewoong Jung <[email protected]> wrote: >>>>>> > Yeah, after some more research, I found that TImeDatum is a somewhat >>>>>> > ambiguous data type. >>>>>> > >>>>>> > Its original purpose is to represent a time of a day, i.e. hh:mm:ss >>>>>> > part of an instant. So, it could be viewed as instant data though some >>>>>> > may argue it's incomplete to fully represent an instant. Anyway, given >>>>>> > that TimestampDatum has limitation in terms of the time range it can >>>>>> > represent, and given that Tajo doesn't have DateTime data type, >>>>>> > clients should be allowed to use it with a timezone. >>>>>> > >>>>>> > I'll change the direction and try to address the issue by fixing the >>>>>> > underflow error. >>>>>> > >>>>>> > Thanks for the input. :) >>>>>> > >>>>>> > On Wed, Nov 19, 2014 at 10:24 PM, Jong-young Park <[email protected]> >>>>>> wrote: >>>>>> >> Hi, Jaewoong. >>>>>> >> >>>>>> >> To express time period value, IntervalDatum is existing as I know. >>>>>> >> >>>>>> >> So I think it is right that TimeDatum is for some time point. >>>>>> >> >>>>>> >> And TimestampDatum seems it is doing both roles of DateDatum and >>>>>> TimeDatum. >>>>>> >> >>>>>> >> Regards, >>>>>> >> Jongyoung. >>>>>> >> >>>>>> >> On Wed Nov 19 2014 at 오후 5:23:40 Jaewoong Jung <[email protected]> >>>>>> wrote: >>>>>> >> >>>>>> >>> It turns out, TAJO-1191 is slightly more complicated than I thought. >>>>>> >>> (https://issues.apache.org/jira/browse/TAJO-1191) >>>>>> >>> >>>>>> >>> Basically, it's about whether TimeDatum may have a timezone tied with >>>>>> >>> it. I **believe** TimeDatum is originally designed to hold a time >>>>>> >>> period value, not an instant. (TimestampDatum seems to be the >>>>>> >>> canonical container for instants.) So, it doesn't make sense to apply >>>>>> >>> any timezones to TimeDatum values, but it's being done in a few >>>>>> >>> places. And, that's why the test is failing on my machines. >>>>>> >>> >>>>>> >>> I'm going to try to fix it by removing all timezone-related >>>>>> >>> references >>>>>> >>> around the class, but I want to check my assumption with you before I >>>>>> >>> proceed. >>>>>> >>> >>>>>> >>> What do you think about it? >>>>>> >>> >>>>>>
