Thanks @Timo for the detailed reply, let's go on this topic on this discussion, I've merged all mails to this discussion.
> LOCALDATE / LOCALTIME / LOCALTIMESTAMP > > --> uses session time zone, returns DATE/TIME/TIMESTAMP > > CURRENT_DATE/CURRENT_TIME/CURRENT_TIMESTAMP > > --> uses session time zone, returns DATE/TIME/TIMESTAMP > > I'm very sceptical about this behavior. Almost all mature systems (Oracle, > Postgres) and new high quality systems (Presto, Snowflake) use a data type > with some degree of time zone information encoded. In a globalized world with > businesses spanning different regions, I think we should do this as well. > There should be a difference between CURRENT_TIMESTAMP and LOCALTIMESTAMP. > And users should be able to choose which behavior they prefer for their > pipeline. I know that the two series should be different at first glance, but different SQL engines can have their own explanations,for example, CURRENT_TIMESTAMP and LOCALTIMESTAMP are synonyms in Snowflake[1] and has no difference, and Spark only supports the later one and doesn’t support LOCALTIME/LOCALTIMESTAMP[2]. > If we would design this from scatch, I would suggest the following: > > - drop CURRENT_DATE / CURRENT_TIME and let users pick LOCALDATE / LOCALTIME > for materialized timestamp parts The function CURRENT_DATE/CURRENT_TIME is supporting in SQL standard, but LOCALDATE not, I don’t think it’s a good idea that dropping functions which SQL standard supported and introducing a replacement which SQL standard not reminded. > - CURRENT_TIMESTAMP should return a TIMESTAMP WITH TIME ZONE to materialize > all session time information into every record. It it the most generic data > type and allows to cast to all other timestamp data types. This generic > ability can be used for filter predicates as well either through implicit or > explicit casting. TIMESTAMP WITH TIME ZONE indeed contains more information to describe a time point, but the type TIMESTAMP can cast to all other timestamp data types combining with session time zone as well, and it also can be used for filter predicates. For type casting between BIGINT and TIMESTAMP, I think the function way using TO_TIMEMTAMP()/FROM_UNIXTIMESTAMP() is more clear. > PROCTIME/ROWTIME should be time functions based on a long value. Both > System.currentMillis() and our watermark system work on long values. Those > should return TIMESTAMP WITH LOCAL TIME ZONE because the main calculation > should always happen based on UTC. > We discussed it in a different thread, but we should allow PROCTIME globally. > People need a way to create instances of TIMESTAMP WITH LOCAL TIME ZONE. This > is not considered in the current design doc. > Many pipelines contain UTC timestamps and thus it should be easy to create > one. > Also, both CURRENT_TIMESTAMP and LOCALTIMESTAMP can work with this type > because we should remember that TIMESTAMP WITH LOCAL TIME ZONE accepts all > timestamp data types as casting target [1]. We could allow TIMESTAMP WITH > TIME ZONE in the future for ROWTIME. > In any case, windows should simply adapt their behavior to the passed > timestamp type. And with TIMESTAMP WITH LOCAL TIME ZONE a day is defined by > considering the current session time zone. I also agree returning TIMESTAMP WITH LOCAL TIME ZONE for PROCTIME has more clear semantics, but I realized that user didn’t care the type but more about the expressed value they saw, and change the type from TIMESTAMP to TIMESTAMP WITH LOCAL TIME ZONE brings huge refactor that we need consider all places where the TIMESTAMP type used, and many builtin functions and UDFs doest not support TIMESTAMP WITH LOCAL TIME ZONE type. That means both user and Flink devs need to refactor the code(UDF, builtin functions, sql pipeline), to be honest, I didn’t see strong motivation that we have to do the pretty big refactor from user’s perspective and developer’s perspective. In one word, both your suggestion and my proposal can resolve almost all user problems,the divergence is whether we need to spend pretty energy just to get a bit more accurate semantics? I think we need a tradeoff. Best, Leonard [1] https://trino.io/docs/current/functions/datetime.html#current_timestamp <https://trino.io/docs/current/functions/datetime.html#current_timestamp> [2] https://issues.apache.org/jira/browse/SPARK-30374 <https://issues.apache.org/jira/browse/SPARK-30374> > 2021-01-22,00:53,Timo Walther <twal...@apache.org> : > > Hi Leonard, > > thanks for working on this topic. I agree that time handling is not easy in > Flink at the moment. We added new time data types (and some are still not > supported which even further complicates things like TIME(9)). We should > definitely improve this situation for users. > > This is a pretty opinionated topic and it seems that the SQL standard is not > really deciding this but is at least supporting. So let me express my opinion > for the most important functions: > > LOCALDATE / LOCALTIME / LOCALTIMESTAMP > > --> uses session time zone, returns DATE/TIME/TIMESTAMP > > I think those are the most obvious ones because the LOCAL indicates that the > locality should be materialized into the result and any time zone information > (coming from session config or data) is not important afterwards. > > CURRENT_DATE/CURRENT_TIME/CURRENT_TIMESTAMP > > --> uses session time zone, returns DATE/TIME/TIMESTAMP > > I'm very sceptical about this behavior. Almost all mature systems (Oracle, > Postgres) and new high quality systems (Presto, Snowflake) use a data type > with some degree of time zone information encoded. In a globalized world with > businesses spanning different regions, I think we should do this as well. > There should be a difference between CURRENT_TIMESTAMP and LOCALTIMESTAMP. > And users should be able to choose which behavior they prefer for their > pipeline. > > If we would design this from scatch, I would suggest the following: > > - drop CURRENT_DATE / CURRENT_TIME and let users pick LOCALDATE / LOCALTIME > for materialized timestamp parts > > - CURRENT_TIMESTAMP should return a TIMESTAMP WITH TIME ZONE to materialize > all session time information into every record. It it the most generic data > type and allows to cast to all other timestamp data types. This generic > ability can be used for filter predicates as well either through implicit or > explicit casting. > > PROCTIME/ROWTIME should be time functions based on a long value. Both > System.currentMillis() and our watermark system work on long values. Those > should return TIMESTAMP WITH LOCAL TIME ZONE because the main calculation > should always happen based on UTC. We discussed it in a different thread, but > we should allow PROCTIME globally. People need a way to create instances of > TIMESTAMP WITH LOCAL TIME ZONE. This is not considered in the current design > doc. Many pipelines contain UTC timestamps and thus it should be easy to > create one. Also, both CURRENT_TIMESTAMP and LOCALTIMESTAMP can work with > this type because we should remember that TIMESTAMP WITH LOCAL TIME ZONE > accepts all timestamp data types as casting target [1]. We could allow > TIMESTAMP WITH TIME ZONE in the future for ROWTIME. > > In any case, windows should simply adapt their behavior to the passed > timestamp type. And with TIMESTAMP WITH LOCAL TIME ZONE a day is defined by > considering the current session time zone. > > If we would like to design this with less effort required, we could think > about returning TIMESTAMP WITH LOCAL TIME ZONE also for CURRENT_TIMESTAMP. > > > I will try to involve more people into this discussion. > > Thanks, > Timo > > [1] > https://docs.oracle.com/en/database/oracle/oracle-database/21/sqlrf/Data-Types.html#GUID-E7CA339A-2093-4FE4-A36E-1D09593591D3 > > <https://docs.oracle.com/en/database/oracle/oracle-database/21/sqlrf/Data-Types.html#GUID-E7CA339A-2093-4FE4-A36E-1D09593591D3> > 2021-01-21,22:32,Leonard Xu <xbjt...@gmail.com> : >> Before the changes, as I am writing this reply, the local time here is >> 2021-01-21 12:03:35 (Beijing time, UTC+8). >> And I tried these 5 functions in sql client, and got: >> >> Flink SQL> select now(), PROCTIME(), CURRENT_TIMESTAMP, CURRENT_DATE, >> CURRENT_TIME; >> +-------------------------+-------------------------+-------------------------+--------------+--------------+ >> | EXPR$0 | EXPR$1 | >> CURRENT_TIMESTAMP | CURRENT_DATE | CURRENT_TIME | >> +-------------------------+-------------------------+-------------------------+--------------+--------------+ >> | 2021-01-21T04:03:35.228 | 2021-01-21T04:03:35.228 | >> 2021-01-21T04:03:35.228 | 2021-01-21 | 04:03:35.228 | >> +-------------------------+-------------------------+-------------------------+--------------+--------------+ >> After the changes, the expected behavior will change to: >> >> Flink SQL> select now(), PROCTIME(), CURRENT_TIMESTAMP, CURRENT_DATE, >> CURRENT_TIME; >> +-------------------------+-------------------------+-------------------------+--------------+--------------+ >> | EXPR$0 | EXPR$1 | >> CURRENT_TIMESTAMP | CURRENT_DATE | CURRENT_TIME | >> +-------------------------+-------------------------+-------------------------+--------------+--------------+ >> | 2021-01-21T12:03:35.228 | 2021-01-21T12:03:35.228 | >> 2021-01-21T12:03:35.228 | 2021-01-21 | 12:03:35.228 | >> +-------------------------+-------------------------+-------------------------+--------------+--------------+ >> The return type of now(), proctime() and CURRENT_TIMESTAMP still be >> TIMESTAMP; > > To Kurt, thanks for the intuitive case, it really clear, you’re wright that > I want to propose to change the return value of these functions. It’s the > most important part of the topic from user's perspective. > >> I think this definitely deserves a FLIP. > To Jark, nice suggestion, I prepared a FLIP for this topic, and will start > the FLIP discussion soon. > >>> If use the default Flink SQL, the window time range of the >>> statistics is incorrect, then the statistical results will naturally be >>> incorrect. > To zhisheng, sorry to hear that this problem influenced your production jobs, > Could you share your SQL pattern? we can have more inputs and try to > resolve them. > > > Best, > Leonard > 2021-01-21,14:19,Jark Wu <imj...@gmail.com> : > > Great examples to understand the problem and the proposed changes, @Kurt! > > Thanks Leonard for investigating this problem. > The time-zone problems around time functions and windows have bothered a > lot of users. It's time to fix them! > > The return value changes sound reasonable to me, and keeping the return > type unchanged will minimize the surprise to the users. > Besides that, I think it would be better to mention how this affects the > window behaviors, and the interoperability with DataStream. > > I think this definitely deserves a FLIP. > > ==================================================== > > Hi zhisheng, > > Do you have examples to illustrate which case will get the wrong window > boundaries? > That will help to verify whether the proposed changes can solve your > problem. > > Best, > Jark > 2021-01-21,12:54,zhisheng <173855...@qq.com> : > > Thanks to Leonard Xu for discussing this tricky topic. At present, there are > many Flink jobs in our production environment that are used to count > day-level reports (eg: count PV/UV ). > > If use the default Flink SQL, the window time range of the statistics > is incorrect, then the statistical results will naturally be incorrect. > > The user needs to deal with the time zone manually in order to solve the > problem. > > If Flink itself can solve these time zone issues, then I think it will be > user-friendly. > > Thank you > > Best!; > zhisheng > 2021-01-21,12:11,Kurt Young <ykt...@gmail.com> : > > cc this to user & user-zh mailing list because this will affect lots of > users, and also quite a lot of users > were asking questions around this topic. > > Let me try to understand this from user's perspective. > > Your proposal will affect five functions, which are: > PROCTIME() > NOW() > CURRENT_DATE > CURRENT_TIME > CURRENT_TIMESTAMP > Before the changes, as I am writing this reply, the local time here is > 2021-01-21 12:03:35 (Beijing time, UTC+8). > And I tried these 5 functions in sql client, and got: > > Flink SQL> select now(), PROCTIME(), CURRENT_TIMESTAMP, CURRENT_DATE, > CURRENT_TIME; > +-------------------------+-------------------------+-------------------------+--------------+--------------+ > | EXPR$0 | EXPR$1 | CURRENT_TIMESTAMP > | CURRENT_DATE | CURRENT_TIME | > +-------------------------+-------------------------+-------------------------+--------------+--------------+ > | 2021-01-21T04:03:35.228 | 2021-01-21T04:03:35.228 | 2021-01-21T04:03:35.228 > | 2021-01-21 | 04:03:35.228 | > +-------------------------+-------------------------+-------------------------+--------------+--------------+ > After the changes, the expected behavior will change to: > > Flink SQL> select now(), PROCTIME(), CURRENT_TIMESTAMP, CURRENT_DATE, > CURRENT_TIME; > +-------------------------+-------------------------+-------------------------+--------------+--------------+ > | EXPR$0 | EXPR$1 | CURRENT_TIMESTAMP > | CURRENT_DATE | CURRENT_TIME | > +-------------------------+-------------------------+-------------------------+--------------+--------------+ > | 2021-01-21T12:03:35.228 | 2021-01-21T12:03:35.228 | 2021-01-21T12:03:35.228 > | 2021-01-21 | 12:03:35.228 | > +-------------------------+-------------------------+-------------------------+--------------+--------------+ > The return type of now(), proctime() and CURRENT_TIMESTAMP still be TIMESTAMP; > > Best, > Kurt