Hi Zoltan, I don't fully understand your proposal for table-specific timestamp type semantics. I think it will be helpful to everyone in this conversation if you can identify the expected behavior for a few concrete scenarios.
Suppose we have a Hive metastore table hivelogs with a column named ts with the hive timestamp type as described here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp>. This table was created by Hive and is usually accessed through Hive or Presto. Suppose again we have a Hive metastore table sparklogs with a column named ts with the Spark SQL timestamp type as described here: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.TimestampType$ <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.TimestampType$>. This table was created by Spark SQL and is usually accessed through Spark SQL. Let's say Spark SQL sets and reads a table property called timestamp_interp to determine timestamp type semantics for that table. Consider a dataframe df defined by sql("SELECT sts as ts FROM sparklogs UNION ALL SELECT hts as ts FROM hivelogs"). Suppose the timestamp_interp table property is absent from hivelogs. For each possible value of timestamp_interp set on the table sparklogs, 1. does df successfully pass analysis (i.e. is it a valid query)? 2. if it's a valid dataframe, what is the type of the ts column? 3. if it's a valid dataframe, what are the semantics of the type of the ts column? Suppose further that Spark SQL sets the timestamp_interp on hivelogs. Can you answer the same three questions for each combination of timestamp_interp on hivelogs and sparklogs? Thank you. Michael > On Jun 2, 2017, at 8:33 AM, Zoltan Ivanfi <z...@cloudera.com> wrote: > > Hi, > > We would like to solve the problem of interoperability of existing data, and > that is the main use case for having table-level control. Spark should be > able to read timestamps written by Impala or Hive and at the same time read > back its own data. These have different semantics, so having a single flag is > not enough. > > Two separate types will solve this problem indeed, but only once every > component involved supports them. Unfortunately, adding these separate SQL > types is a larger effort that is only feasible in the long term and we would > like to provide a short-term solution for interoperability in the meantime. > > Br, > > Zoltan > > On Fri, Jun 2, 2017 at 1:32 AM Reynold Xin <r...@databricks.com > <mailto:r...@databricks.com>> wrote: > Yea I don't see why this needs to be per table config. If the user wants to > configure it per table, can't they just declare the data type on a per table > basis, once we have separate types for timestamp w/ tz and w/o tz? > > On Thu, Jun 1, 2017 at 4:14 PM, Michael Allman <mich...@videoamp.com > <mailto:mich...@videoamp.com>> wrote: > I would suggest that making timestamp type behavior configurable and > persisted per-table could introduce some real confusion, e.g. in queries > involving tables with different timestamp type semantics. > > I suggest starting with the assumption that timestamp type behavior is a > per-session flag that can be set in a global `spark-defaults.conf` and > consider more granular levels of configuration as people identify solid use > cases. > > Cheers, > > Michael > > > >> On May 30, 2017, at 7:41 AM, Zoltan Ivanfi <z...@cloudera.com >> <mailto:z...@cloudera.com>> wrote: >> >> Hi, >> >> If I remember correctly, the TIMESTAMP type had UTC-normalized local time >> semantics even before Spark 2, so I can understand that Spark considers it >> to be the "established" behavior that must not be broken. Unfortunately, >> this behavior does not provide interoperability with other SQL engines of >> the Hadoop stack. >> >> Let me summarize the findings of this e-mail thread so far: >> Timezone-agnostic TIMESTAMP semantics would be beneficial for >> interoperability and SQL compliance. >> Spark can not make a breaking change. For backward-compatibility with >> existing data, timestamp semantics should be user-configurable on a >> per-table level. >> Before going into the specifics of a possible solution, do we all agree on >> these points? >> >> Thanks, >> >> Zoltan >> >> On Sat, May 27, 2017 at 8:57 PM Imran Rashid <iras...@cloudera.com >> <mailto:iras...@cloudera.com>> wrote: >> I had asked zoltan to bring this discussion to the dev list because I think >> it's a question that extends beyond a single jira (we can't figure out the >> semantics of timestamp in parquet if we don't k ow the overall goal of the >> timestamp type) and since its a design question the entire community should >> be involved. >> >> I think that a lot of the confusion comes because we're talking about >> different ways time zone affect behavior: (1) parsing and (2) behavior when >> changing time zones for processing data. >> >> It seems we agree that spark should eventually provide a timestamp type >> which does conform to the standard. The question is, how do we get there? >> Has spark already broken compliance so much that it's impossible to go back >> without breaking user behavior? Or perhaps spark already has inconsistent >> behavior / broken compatibility within the 2.x line, so its not unthinkable >> to have another breaking change? >> >> (Another part of the confusion is on me -- I believed the behavior change >> was in 2.2, but actually it looks like its in 2.0.1. That changes how we >> think about this in context of what goes into a 2.2 release. SPARK-18350 >> isn't the origin of the difference in behavior.) >> >> First: consider processing data that is already stored in tables, and then >> accessing it from machines in different time zones. The standard is clear >> that "timestamp" should be just like "timestamp without time zone": it does >> not represent one instant in time, rather it's always displayed the same, >> regardless of time zone. This was the behavior in spark 2.0.0 (and 1.6), >> for hive tables stored as text files, and for spark's json formats. >> >> Spark 2.0.1 changed the behavior of the json format (I believe with >> SPARK-16216), so that it behaves more like timestamp *with* time zone. It >> also makes csv behave the same (timestamp in csv was basically broken in >> 2.0.0). However it did *not* change the behavior of a hive textfile; it >> still behaves like "timestamp with*out* time zone". Here's some experiments >> I tried -- there are a bunch of files there for completeness, but mostly >> focus on the difference between query_output_2_0_0.txt vs. >> query_output_2_0_1.txt >> >> https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70 >> <https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70> >> >> Given that spark has changed this behavior post 2.0.0, is it still out of >> the question to change this behavior to bring it back in line with the sql >> standard for timestamp (without time zone) in the 2.x line? Or, as reynold >> proposes, is the only option at this point to add an off-by-default feature >> flag to get "timestamp without time zone" semantics? >> >> >> Second, there is the question of parsing strings into timestamp type. I'm >> far less knowledgeable about this, so I mostly just have questions: >> >> * does the standard dictate what the parsing behavior should be for >> timestamp (without time zone) when a time zone is present? >> >> * if it does and spark violates this standard is it worth trying to retain >> the *other* semantics of timestamp without time zone, even if we violate the >> parsing part? >> >> I did look at what postgres does for comparison: >> >> https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c >> <https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c> >> >> spark's timestamp certainly does not match postgres's timestamp for parsing, >> it seems closer to postgres's "timestamp with timezone" -- though I dunno if >> that is standard behavior at all. >> >> thanks, >> Imran >> >> On Fri, May 26, 2017 at 1:27 AM, Reynold Xin <r...@databricks.com >> <mailto:r...@databricks.com>> wrote: >> That's just my point 4, isn't it? >> >> >> On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <ofir.ma...@equalum.io >> <mailto:ofir.ma...@equalum.io>> wrote: >> Reynold, >> my point is that Spark should aim to follow the SQL standard instead of >> rolling its own type system. >> If I understand correctly, the existing implementation is similar to >> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle.. >> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH TIMEZONE >> data types which are missing from Spark. >> So, it is better (for me) if instead of extending the existing types, Spark >> would just implement the additional well-defined types properly. >> Just trying to copy-paste CREATE TABLE between SQL engines should not be an >> exercise of flags and incompatibilities. >> >> Regarding the current behaviour, if I remember correctly I had to force our >> spark O/S user into UTC so Spark wont change my timestamps. >> >> Ofir Manor >> >> Co-Founder & CTO | Equalum >> >> >> Mobile: +972-54-7801286 <tel:%2B972-54-7801286> | Email: >> ofir.ma...@equalum.io <mailto:ofir.ma...@equalum.io> >> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <r...@databricks.com >> <mailto:r...@databricks.com>> wrote: >> Zoltan, >> >> Thanks for raising this again, although I'm a bit confused since I've >> communicated with you a few times on JIRA and on private emails to explain >> that you have some misunderstanding of the timestamp type in Spark and some >> of your statements are wrong (e.g. the except text file part). Not sure why >> you didn't get any of those. >> >> >> Here's another try: >> >> >> 1. I think you guys misunderstood the semantics of timestamp in Spark before >> session local timezone change. IIUC, Spark has always assumed timestamps to >> be with timezone, since it parses timestamps with timezone and does all the >> datetime conversions with timezone in mind (it doesn't ignore timezone if a >> timestamp string has timezone specified). The session local timezone change >> further pushes Spark to that direction, but the semantics has been with >> timezone before that change. Just run Spark on machines with different >> timezone and you will know what I'm talking about. >> >> 2. CSV/Text is not different. The data type has always been "with timezone". >> If you put a timezone in the timestamp string, it parses the timezone. >> >> 3. We can't change semantics now, because it'd break all existing Spark apps. >> >> 4. We can however introduce a new timestamp without timezone type, and have >> a config flag to specify which one (with tz or without tz) is the default >> behavior. >> >> >> >> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <z...@cloudera.com >> <mailto:z...@cloudera.com>> wrote: >> Hi, >> >> Sorry if you receive this mail twice, it seems that my first attempt did not >> make it to the list for some reason. >> >> I would like to start a discussion about SPARK-18350 >> <https://issues.apache.org/jira/browse/SPARK-18350> before it gets released >> because it seems to be going in a different direction than what other SQL >> engines of the Hadoop stack do. >> >> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT TIME >> ZONE) to have timezone-agnostic semantics - basically a type that expresses >> readings from calendars and clocks and is unaffected by time zone. In the >> Hadoop stack, Impala has always worked like this and recently Presto also >> took steps <https://github.com/prestodb/presto/issues/7122> to become >> standards compliant. (Presto's design doc >> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit> >> also contains a great summary of the different semantics.) Hive has a >> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major source >> of incompatibility that is already being addressed >> <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in >> SparkSQL, however, has UTC-normalized local time semantics (except for >> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE >> type. >> >> Given that timezone-agnostic TIMESTAMP semantics provide standards >> compliance and consistency with most SQL engines, I was wondering whether >> SparkSQL should also consider it in order to become ANSI SQL compliant and >> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL >> adapt this semantics in the future, SPARK-18350 >> <https://issues.apache.org/jira/browse/SPARK-18350> may turn out to be a >> source of problems. Please correct me if I'm wrong, but this change seems to >> explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP type. >> I think SPARK-18350 would be a great feature for a separate TIMESTAMP WITH >> TIME ZONE type, but the plain unqualified TIMESTAMP type would be better >> becoming timezone-agnostic instead of gaining further timezone-aware >> capabilities. (Of course becoming timezone-agnostic would be a behavior >> change, so it must be optional and configurable by the user, as in Presto.) >> >> I would like to hear your opinions about this concern and about TIMESTAMP >> semantics in general. Does the community agree that a standards-compliant >> and interoperable TIMESTAMP type is desired? Do you perceive SPARK-18350 as >> a potential problem in achieving this or do I misunderstand the effects of >> this change? >> >> Thanks, >> >> Zoltan >> >> --- >> >> List of links in case in-line links do not work: >> SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350 >> <https://issues.apache.org/jira/browse/SPARK-18350> >> Presto's change: https://github.com/prestodb/presto/issues/7122 >> <https://github.com/prestodb/presto/issues/7122> >> Presto's design doc: >> https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit >> >> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit> >> >> >> >> > >