I'm happy to work with you and fix a major nuisance :) Later, I'll share the timezone related problem with you when I found additional bugs.
Warm regards, Hyunsik On Tue, Nov 25, 2014 at 4:17 PM, Jaewoong Jung <[email protected]> wrote: > Everything is very clear now thanks to your explanation. :) > > Okay, then I'll fix the issue by making DateDatum timezone-neutral and > TimeDatum UTC-based. Also, I'll play with PostgreSQL to understand its > timezone model better. > > Meanwhile, please feel free to assign timezone-related bugs to me as > you see fit. > > On Mon, Nov 24, 2014 at 5:52 PM, Hyunsik Choi <[email protected]> wrote: >> Thank you all guys for your comments. >> >> Jaewoong, >> >> I leave inline comments. If my answers are not enough for your >> question or I misunderstood your question, please feel free to ask >> additional questions. >> >> Best regards, >> Hyunsik >> >> On Fri, Nov 21, 2014 at 12:50 AM, Jaewoong Jung <[email protected]> wrote: >>> There's another issue that hopefully Hyunsik would be able to clarify, >>> and it's very crucial to handling timezones in these data types. >>> >>> Q: So, let's say (and I agree) TimeDatum represents an instant, so can >>> be timezoned. Then, is it a UTC time or a local time? >>> >> >> TimeDatum is UTC or should be UTC if some parts are not. >> >>> Let me explain why this question is important. >>> >>> DateDatum represents an instant. And, it is implicitly timezoned to >>> the user local time. Which means, if I use '11/20/2014', it's >>> '11/20/2014 0:00:00 PST' and is equivalent to a TimestampDatum for >>> '11/20/2014 8:00:00 UTC'. (BTW, the compareTo implementation for >>> DateDatum is broken in this regard. I'll file a separate issue for >>> that.) >> >> I agree with your proposal. I also think that DateDatum does not need >> to be timezoned. We should keep it as DateDatum instead of >> TimestampDatum. Thank you for nice finding! >> >>> What about TimeDatum? It is currently timezoned to UTC, (surprise, >>> anyone?) if I understood the code correctly. When we add a TimeDatum >>> to a DateDatum, we convert the TimeDatum to the user local time, which >>> implies TimeDatum is UTC-based. (Also, the comment next to the line >>> explicitly mentions it.) >> >> You are right. TimeDatum represents UTC time. >> >> FYI, I'd like to describe additional background. There are only two >> entry points to take time or timestamp values. One is records in input >> tables, and another is SQL statements. Currently, Input table uses the >> system global timezone specified in TajoConf (tajo-site.xml file). >> Later, we will add one table property to allow users to specify >> timezone for each table. For SQL statement (e.g., SELECT time >> '03:00:00'), we will use client timezone. Also, we will provide some >> expression to allow users to specify timezone for time or timestamp in >> SQL statements. >> >> Consequently, only two entry points have to deal with timezone for >> Timestamp and Time. Other parts in Tajo should deal with all values in >> UTC. >> >>> >>> So, here's the problem. >>> >>> Why do they have different timezones? They're incomplete as an instant >>> when used alone and are complementary to each other. This is an >>> important concept. To understand it, you have to think about why >>> adding DateDatum and TimeDatum is allowed in the code. Originally, >>> instants can't be added. (And, that's why I thought TimeDatum is not >>> an instant.) You can't add (say) 11/20/2014 to 11/23/2019. Subtracting >>> an instant from an instant makes sense and results in a period, but >>> they can't be added. >>> >>> However, in the case of DateDatum and TimeDatum, additions are allowed >>> because they're complementary to each other, and what the code does >>> **conceptually** is concatenate the two. >> >> Great insight! So far, I haven't thought it. >> >>> Therefore, because they're intended to be used together, I'd argue >>> they shouldn't have different timezones. Also, if they have different >>> timezones, additions can't have a simple correct answer. What's the >>> correct answer of 11/20/2014 (PST) + 8:00:00 (UTC)? There's no clear >>> answer because they can't be simply concatenated. >>> >>> (FWIW, the current Tajo code thinks the answer is 11/20/2014 8:00:00 >>> in UTC. How many of you got it right?) >> >> You are definitely right :) If they have different timezone, the >> problem becomes very complicated. Nobody wants it :) As I mentioned, >> Timezone problem of Timestamp and Time data types should be addressed >> in two entry points and client. We need to keep the processing >> approach simple. >> >>> This can cause a lot of confusion to users. When they use a date >>> alone, it is interpreted as a local time. But, as soon as they add a >>> time to it, it is silently converted to UTC in a way which is very >>> unexpected to many users. >>> >>> Why am I emphasizing it is unexpected? Look at the comparison below. >>> >>> What's the answer to 11/20/2014 (DateDatum) + 8 hours (IntervalDatum)? >>> It's 11/20/2014 8:00 in a local time (PST on my machine). How about >>> 11/20/2014 (DateDatum) + 8:00:00 (TimeDatum)? It's 11/20/2014 8:00 in >>> UTC as I wrote above. How many users would be able to expect this? >> >> They are definitely a bug. We follow PostgreSQL in all aspects, and >> the following results come from the PostgreSQL. The results of two >> operations are the same. >> >> hyunsik=> SELECT date '11/20/2014' + time '08:00'; >> ?column? >> --------------------- >> 2014-11-20 08:00:00 >> (1 row) >> >> hyunsik=> select date '11/20/2014' + interval '8 hrs'; >> ?column? >> --------------------- >> 2014-11-20 08:00:00 >> (1 row) >> >> >>> >>> So, coming back to my original question. What timezone should a >>> TimeDatum have? UTC or local time? It's currently UTC. But, I believe >>> it should be changed to the local time zone. >> >> TimeDatum is UTC. >> >> In sum, we should keep both TimeDatum and TimestampDatum UTC values. >> Then, we should address timezone offsets in two entry points and >> client side. >> >>> >>> (Sorry for the long email. But, I think it's critical to get this >>> right and build consensus over it so that we can provide a consistent >>> behavior going forward. The actual fix will be really simple, though.) >>> >>> On Thu, Nov 20, 2014 at 2:44 AM, Hyunsik Choi <[email protected]> >>> wrote: >>>> Those parts have poor documentation. >>>> >>>> I agree with your investigation. I also could find many misuse of timezone >>>> in many parts. We should make them clear and fix them in this chance. >>>> >>>> I just got off the plane, and I'm still on the road. So, I'll give more >>>> comments tomorrow. >>>> >>>> - hyunsik >>>> >>>> On Thu, Nov 20, 2014 at 5:24 PM, Jaewoong Jung <[email protected]> wrote: >>>> >>>>> Wow, this seemingly trivial issue has surprisingly many problems involved. >>>>> >>>>> The most critical one, though, is TimeMeta class. Presumably because >>>>> it is poorly documented, it is being used for two different purposes >>>>> in Tajo code base. Some code treats it as a date time representation, >>>>> which I believe is the original intention, but some treat it as the >>>>> human-readable equivalent of TimeDatum by completely ignoring >>>>> date-related fields. >>>>> >>>>> For example, DateTimeUtil.date2j(long julianDate, TimeMeta tm), which >>>>> converts a julian timestamp to a TimeMeta value, doesn't touch >>>>> dayOfMonth, monthOfYear, or years values and just puts all values for >>>>> hour and above units in hours field. >>>>> >>>>> There are other minor problems like incorrect comments and absent >>>>> default values, but the most critical one is misuse of TimeMeta. >>>>> >>>>> I'll try to break up my patches so that each has a clear and >>>>> easy-to-understand goal. Sending this heads-up to let you know >>>>> there'll be more issues filed and patches sent than you might expect. >>>>> >>>>> On Wed, Nov 19, 2014 at 10:56 PM, Jaewoong Jung <[email protected]> wrote: >>>>> > Yeah, after some more research, I found that TImeDatum is a somewhat >>>>> > ambiguous data type. >>>>> > >>>>> > Its original purpose is to represent a time of a day, i.e. hh:mm:ss >>>>> > part of an instant. So, it could be viewed as instant data though some >>>>> > may argue it's incomplete to fully represent an instant. Anyway, given >>>>> > that TimestampDatum has limitation in terms of the time range it can >>>>> > represent, and given that Tajo doesn't have DateTime data type, >>>>> > clients should be allowed to use it with a timezone. >>>>> > >>>>> > I'll change the direction and try to address the issue by fixing the >>>>> > underflow error. >>>>> > >>>>> > Thanks for the input. :) >>>>> > >>>>> > On Wed, Nov 19, 2014 at 10:24 PM, Jong-young Park <[email protected]> >>>>> wrote: >>>>> >> Hi, Jaewoong. >>>>> >> >>>>> >> To express time period value, IntervalDatum is existing as I know. >>>>> >> >>>>> >> So I think it is right that TimeDatum is for some time point. >>>>> >> >>>>> >> And TimestampDatum seems it is doing both roles of DateDatum and >>>>> TimeDatum. >>>>> >> >>>>> >> Regards, >>>>> >> Jongyoung. >>>>> >> >>>>> >> On Wed Nov 19 2014 at 오후 5:23:40 Jaewoong Jung <[email protected]> >>>>> wrote: >>>>> >> >>>>> >>> It turns out, TAJO-1191 is slightly more complicated than I thought. >>>>> >>> (https://issues.apache.org/jira/browse/TAJO-1191) >>>>> >>> >>>>> >>> Basically, it's about whether TimeDatum may have a timezone tied with >>>>> >>> it. I **believe** TimeDatum is originally designed to hold a time >>>>> >>> period value, not an instant. (TimestampDatum seems to be the >>>>> >>> canonical container for instants.) So, it doesn't make sense to apply >>>>> >>> any timezones to TimeDatum values, but it's being done in a few >>>>> >>> places. And, that's why the test is failing on my machines. >>>>> >>> >>>>> >>> I'm going to try to fix it by removing all timezone-related references >>>>> >>> around the class, but I want to check my assumption with you before I >>>>> >>> proceed. >>>>> >>> >>>>> >>> What do you think about it? >>>>> >>> >>>>>
