Hi Zoltan, thanks for bringing this up, this is really important to me! Personally, as a user developing app on top of Spark and other tools, the current timestamp semantics has been a source of some pain - needing to undo Spark's "auto-correcting" of timestamps . It would be really great if we could have standard timestamp handling, like every other SQL-compliant database and processing engine (choosing between the two main SQL types). I was under the impression that better SQL compliant was one of the top priorities of the Spark project. I guess it is pretty lake in the release cycle - but it seems SPARK-18350 was just introduced a couple of weeks ago. Maybe it should be reverted to unblock the 2.2 release, and a more proper solution could be implemented for the next release after a more comprehensive discussion? Just my two cents,
Ofir Manor Co-Founder & CTO | Equalum Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io On Wed, May 24, 2017 at 6:46 PM, Zoltan Ivanfi <z...@cloudera.com> wrote: > Hi, > > Sorry if you receive this mail twice, it seems that my first attempt did > not make it to the list for some reason. > > I would like to start a discussion about SPARK-18350 > <https://issues.apache.org/jira/browse/SPARK-18350> before it gets > released because it seems to be going in a different direction than what > other SQL engines of the Hadoop stack do. > > ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT TIME > ZONE) to have timezone-agnostic semantics - basically a type that expresses > readings from calendars and clocks and is unaffected by time zone. In the > Hadoop stack, Impala has always worked like this and recently Presto also > took steps <https://github.com/prestodb/presto/issues/7122> to become > standards compliant. (Presto's design doc > <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit> > also contains a great summary of the different semantics.) Hive has a > timezone-agnostic TIMESTAMP type as well (except for Parquet, a major > source of incompatibility that is already being addressed > <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in > SparkSQL, however, has UTC-normalized local time semantics (except for > textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE > type. > > Given that timezone-agnostic TIMESTAMP semantics provide standards > compliance and consistency with most SQL engines, I was wondering whether > SparkSQL should also consider it in order to become ANSI SQL compliant and > interoperable with other SQL engines of the Hadoop stack. Should SparkSQL > adapt this semantics in the future, SPARK-18350 > <https://issues.apache.org/jira/browse/SPARK-18350> may turn out to be a > source of problems. Please correct me if I'm wrong, but this change seems > to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP > type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP > WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be > better becoming timezone-agnostic instead of gaining further timezone-aware > capabilities. (Of course becoming timezone-agnostic would be a behavior > change, so it must be optional and configurable by the user, as in Presto.) > > I would like to hear your opinions about this concern and about TIMESTAMP > semantics in general. Does the community agree that a standards-compliant > and interoperable TIMESTAMP type is desired? Do you perceive SPARK-18350 as > a potential problem in achieving this or do I misunderstand the effects of > this change? > > Thanks, > > Zoltan > > --- > > List of links in case in-line links do not work: > > - > > SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350 > - > > Presto's change: https://github.com/prestodb/presto/issues/7122 > - > > Presto's design doc: https://docs.google.com/document/d/ > 1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit > > <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit> > > >