subject:"Re\: SQL TIMESTAMP semantics vs. SPARK\-18350"

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-06-06 Thread Zoltan Ivanfi

Hi Michael,

To answer this I think we should distinguish between the long-term fix and
the short-term fix.

If understand the replies correctly, everyone agrees that the desired
long-term fix is to have two separate SQL types (TIMESTAMP [WITH|WITHOUT]
TIME ZONE). Because of having separate types, mixing them as you described
can not happen (unless a new feature intentionally allows that). Of course,
conversions are still needed, but there are many examples from different
database systems that we can follow.

Since having two separate types is a huge effort, for a short term solution
I would suggest allowing the single existing TIMESTAMP type to allow both
semantics, configurable per table. The implementation of timezone-agnostic
semantics could be similar to Hive. In Hive, just like in Spark, a
timestamp is UTC-normalized internally but it is shown as a local time when
it gets displayed. To achieve timezone-agnostic behavior, Hive still uses
UTC-based timestamps in memory and adjusts on-disk data to/from this
internal representation if needed. When the on-disk data is UTC-normalized
as well, it matches this internal representation, so the on-disk value
directly corresponds to the UTC instant of the in-memory representation.

When the on-disk data is supposed to have timezone-agnostic semantics, the
on-disk value is made to match the local time value of the in-memory
timestamp, so the value that ultimately gets displayed to the user has
timezone-agnostic semantics (although the corresponding UTC value will be
different depending on the local time zone). So instead of implementing a
separate in-memory representation for timezone-agnostic timestamps, the
desired on-disk semantics are simulated on top of the existing
representation. Timestamps are adjusted during reading/writing as needed.

Implementing this workaround takes a lot less effort and simplifies some
scenarios as well. For example, the situation that you described (union of
two queries returning timestamps of different semantics) does not have to
be handled explicitly, since the in-memory representation are the same,
including their interpretation. Semantics only matter when reading/writing
timestamps from/to disk.

A disadvantage of this workaround is that it is not perfect. In most time
zones, there is an hour skipped by the DST change every year.
Timezone-agnostic timestamps from that single hour can not be emulated this
way, because they are invalid in the local timezone, so there is no UTC
instant that would ultimately get displayed as the desired timestamp. But
that only affects ~0.01% of all timestamps and adapting this workaround
would allow interoperability with 99.99% of timezone-agnostic timestamps
written by Impala and Hive instead of the current situation in which 0% of
these timestamps are interpreted correctly.

Please let me know if some parts of my description were unclear and I will
gladly elaborate on them.

Thanks,

Zoltan

On Fri, Jun 2, 2017 at 9:41 PM Michael Allman  wrote:

> Hi Zoltan,
>
> I don't fully understand your proposal for table-specific timestamp type
> semantics. I think it will be helpful to everyone in this conversation if
> you can identify the expected behavior for a few concrete scenarios.
>
> Suppose we have a Hive metastore table hivelogs with a column named ts
> with the hive timestamp type as described here:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp.
> This table was created by Hive and is usually accessed through Hive or
> Presto.
>
> Suppose again we have a Hive metastore table sparklogs with a column named
> ts with the Spark SQL timestamp type as described here:
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.TimestampType$.
> This table was created by Spark SQL and is usually accessed through Spark
> SQL.
>
> Let's say Spark SQL sets and reads a table property called
> timestamp_interp to determine timestamp type semantics for that table.
> Consider a dataframe df defined by sql("SELECT sts as ts FROM sparklogs
> UNION ALL SELECT hts as ts FROM hivelogs"). Suppose the timestamp_interp
> table property is absent from hivelogs. For each possible value of
> timestamp_interp set on the table sparklogs,
>
> 1. does df successfully pass analysis (i.e. is it a valid query)?
> 2. if it's a valid dataframe, what is the type of the ts column?
> 3. if it's a valid dataframe, what are the semantics of the type of the ts
> column?
>
> Suppose further that Spark SQL sets the timestamp_interp on hivelogs. Can
> you answer the same three questions for each combination of
> timestamp_interp on hivelogs and sparklogs?
>
> Thank you.
>
> Michael
>
>
> On Jun 2, 2017, at 8:33 AM, Zoltan Ivanfi  wrote:
>
> Hi,
>
> We would like to solve the problem of interoperability of existing data,
> and that is the main use case for having table-level control. Spark should
> be able to read

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-06-02 Thread Michael Allman

Hi Zoltan,

I don't fully understand your proposal for table-specific timestamp type 
semantics. I think it will be helpful to everyone in this conversation if you 
can identify the expected behavior for a few concrete scenarios.

Suppose we have a Hive metastore table hivelogs with a column named ts with the 
hive timestamp type as described here: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp
 
.
 This table was created by Hive and is usually accessed through Hive or Presto.

Suppose again we have a Hive metastore table sparklogs with a column named ts 
with the Spark SQL timestamp type as described here: 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.TimestampType$
 
.
 This table was created by Spark SQL and is usually accessed through Spark SQL.

Let's say Spark SQL sets and reads a table property called timestamp_interp to 
determine timestamp type semantics for that table. Consider a dataframe df 
defined by sql("SELECT sts as ts FROM sparklogs UNION ALL SELECT hts as ts FROM 
hivelogs"). Suppose the timestamp_interp table property is absent from 
hivelogs. For each possible value of timestamp_interp set on the table 
sparklogs,

1. does df successfully pass analysis (i.e. is it a valid query)?
2. if it's a valid dataframe, what is the type of the ts column?
3. if it's a valid dataframe, what are the semantics of the type of the ts 
column?

Suppose further that Spark SQL sets the timestamp_interp on hivelogs. Can you 
answer the same three questions for each combination of timestamp_interp on 
hivelogs and sparklogs?

Thank you.

Michael


> On Jun 2, 2017, at 8:33 AM, Zoltan Ivanfi  wrote:
> 
> Hi,
> 
> We would like to solve the problem of interoperability of existing data, and 
> that is the main use case for having table-level control. Spark should be 
> able to read timestamps written by Impala or Hive and at the same time read 
> back its own data. These have different semantics, so having a single flag is 
> not enough.
> 
> Two separate types will solve this problem indeed, but only once every 
> component involved supports them. Unfortunately, adding these separate SQL 
> types is a larger effort that is only feasible in the long term and we would 
> like to provide a short-term solution for interoperability in the meantime.
> 
> Br,
> 
> Zoltan
> 
> On Fri, Jun 2, 2017 at 1:32 AM Reynold Xin  > wrote:
> Yea I don't see why this needs to be per table config. If the user wants to 
> configure it per table, can't they just declare the data type on a per table 
> basis, once we have separate types for timestamp w/ tz and w/o tz? 
> 
> On Thu, Jun 1, 2017 at 4:14 PM, Michael Allman  > wrote:
> I would suggest that making timestamp type behavior configurable and 
> persisted per-table could introduce some real confusion, e.g. in queries 
> involving tables with different timestamp type semantics.
> 
> I suggest starting with the assumption that timestamp type behavior is a 
> per-session flag that can be set in a global `spark-defaults.conf` and 
> consider more granular levels of configuration as people identify solid use 
> cases.
> 
> Cheers,
> 
> Michael
> 
> 
> 
>> On May 30, 2017, at 7:41 AM, Zoltan Ivanfi > > wrote:
>> 
>> Hi,
>> 
>> If I remember correctly, the TIMESTAMP type had UTC-normalized local time 
>> semantics even before Spark 2, so I can understand that Spark considers it 
>> to be the "established" behavior that must not be broken. Unfortunately, 
>> this behavior does not provide interoperability with other SQL engines of 
>> the Hadoop stack.
>> 
>> Let me summarize the findings of this e-mail thread so far:
>> Timezone-agnostic TIMESTAMP semantics would be beneficial for 
>> interoperability and SQL compliance.
>> Spark can not make a breaking change. For backward-compatibility with 
>> existing data, timestamp semantics should be user-configurable on a 
>> per-table level.
>> Before going into the specifics of a possible solution, do we all agree on 
>> these points?
>> 
>> Thanks,
>> 
>> Zoltan
>> 
>> On Sat, May 27, 2017 at 8:57 PM Imran Rashid > > wrote:
>> I had asked zoltan to bring this discussion to the dev list because I think 
>> it's a question that extends beyond a single jira (we can't figure out the 
>> semantics of timestamp in parquet if we don't k ow the overall goal of the 
>> timestamp type) and since its a design question the entire community should 
>> be involved.
>> 
>> I think that a lot of the confusion comes because we're talking about 
>>

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-06-02 Thread Zoltan Ivanfi

Hi,

We would like to solve the problem of interoperability of existing data,
and that is the main use case for having table-level control. Spark should
be able to read timestamps written by Impala or Hive and at the same time
read back its own data. These have different semantics, so having a single
flag is not enough.

Two separate types will solve this problem indeed, but only once every
component involved supports them. Unfortunately, adding these separate SQL
types is a larger effort that is only feasible in the long term and we
would like to provide a short-term solution for interoperability in the
meantime.

Br,

Zoltan

On Fri, Jun 2, 2017 at 1:32 AM Reynold Xin  wrote:

> Yea I don't see why this needs to be per table config. If the user wants
> to configure it per table, can't they just declare the data type on a per
> table basis, once we have separate types for timestamp w/ tz and w/o tz?
>
> On Thu, Jun 1, 2017 at 4:14 PM, Michael Allman 
> wrote:
>
>> I would suggest that making timestamp type behavior configurable and
>> persisted per-table could introduce some real confusion, e.g. in queries
>> involving tables with different timestamp type semantics.
>>
>> I suggest starting with the assumption that timestamp type behavior is a
>> per-session flag that can be set in a global `spark-defaults.conf` and
>> consider more granular levels of configuration as people identify solid use
>> cases.
>>
>> Cheers,
>>
>> Michael
>>
>>
>>
>> On May 30, 2017, at 7:41 AM, Zoltan Ivanfi  wrote:
>>
>> Hi,
>>
>> If I remember correctly, the TIMESTAMP type had UTC-normalized local time
>> semantics even before Spark 2, so I can understand that Spark considers it
>> to be the "established" behavior that must not be broken. Unfortunately,
>> this behavior does not provide interoperability with other SQL engines of
>> the Hadoop stack.
>>
>> Let me summarize the findings of this e-mail thread so far:
>>
>>- Timezone-agnostic TIMESTAMP semantics would be beneficial for
>>interoperability and SQL compliance.
>>- Spark can not make a breaking change. For backward-compatibility
>>with existing data, timestamp semantics should be user-configurable on a
>>per-table level.
>>
>> Before going into the specifics of a possible solution, do we all agree
>> on these points?
>>
>> Thanks,
>>
>> Zoltan
>>
>> On Sat, May 27, 2017 at 8:57 PM Imran Rashid 
>> wrote:
>>
>>> I had asked zoltan to bring this discussion to the dev list because I
>>> think it's a question that extends beyond a single jira (we can't figure
>>> out the semantics of timestamp in parquet if we don't k ow the overall goal
>>> of the timestamp type) and since its a design question the entire community
>>> should be involved.
>>>
>>> I think that a lot of the confusion comes because we're talking about
>>> different ways time zone affect behavior: (1) parsing and (2) behavior when
>>> changing time zones for processing data.
>>>
>>> It seems we agree that spark should eventually provide a timestamp type
>>> which does conform to the standard.   The question is, how do we get
>>> there?  Has spark already broken compliance so much that it's impossible to
>>> go back without breaking user behavior?  Or perhaps spark already has
>>> inconsistent behavior / broken compatibility within the 2.x line, so its
>>> not unthinkable to have another breaking change?
>>>
>>> (Another part of the confusion is on me -- I believed the behavior
>>> change was in 2.2, but actually it looks like its in 2.0.1.  That changes
>>> how we think about this in context of what goes into a 2.2
>>> release.  SPARK-18350 isn't the origin of the difference in behavior.)
>>>
>>> First: consider processing data that is already stored in tables, and
>>> then accessing it from machines in different time zones.  The standard is
>>> clear that "timestamp" should be just like "timestamp without time zone":
>>> it does not represent one instant in time, rather it's always displayed the
>>> same, regardless of time zone.  This was the behavior in spark 2.0.0 (and
>>> 1.6),  for hive tables stored as text files, and for spark's json formats.
>>>
>>> Spark 2.0.1  changed the behavior of the json format (I believe
>>> with SPARK-16216), so that it behaves more like timestamp *with* time
>>> zone.  It also makes csv behave the same (timestamp in csv was basically
>>> broken in 2.0.0).  However it did *not* change the behavior of a hive
>>> textfile; it still behaves like "timestamp with*out* time zone".  Here's
>>> some experiments I tried -- there are a bunch of files there for
>>> completeness, but mostly focus on the difference between
>>> query_output_2_0_0.txt vs. query_output_2_0_1.txt
>>>
>>> https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70
>>>
>>> Given that spark has changed this behavior post 2.0.0, is it still out
>>> of the question to change this behavior to bring it back

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-06-01 Thread Reynold Xin

Yea I don't see why this needs to be per table config. If the user wants to
configure it per table, can't they just declare the data type on a per
table basis, once we have separate types for timestamp w/ tz and w/o tz?

On Thu, Jun 1, 2017 at 4:14 PM, Michael Allman  wrote:

> I would suggest that making timestamp type behavior configurable and
> persisted per-table could introduce some real confusion, e.g. in queries
> involving tables with different timestamp type semantics.
>
> I suggest starting with the assumption that timestamp type behavior is a
> per-session flag that can be set in a global `spark-defaults.conf` and
> consider more granular levels of configuration as people identify solid use
> cases.
>
> Cheers,
>
> Michael
>
>
>
> On May 30, 2017, at 7:41 AM, Zoltan Ivanfi  wrote:
>
> Hi,
>
> If I remember correctly, the TIMESTAMP type had UTC-normalized local time
> semantics even before Spark 2, so I can understand that Spark considers it
> to be the "established" behavior that must not be broken. Unfortunately,
> this behavior does not provide interoperability with other SQL engines of
> the Hadoop stack.
>
> Let me summarize the findings of this e-mail thread so far:
>
>- Timezone-agnostic TIMESTAMP semantics would be beneficial for
>interoperability and SQL compliance.
>- Spark can not make a breaking change. For backward-compatibility
>with existing data, timestamp semantics should be user-configurable on a
>per-table level.
>
> Before going into the specifics of a possible solution, do we all agree on
> these points?
>
> Thanks,
>
> Zoltan
>
> On Sat, May 27, 2017 at 8:57 PM Imran Rashid  wrote:
>
>> I had asked zoltan to bring this discussion to the dev list because I
>> think it's a question that extends beyond a single jira (we can't figure
>> out the semantics of timestamp in parquet if we don't k ow the overall goal
>> of the timestamp type) and since its a design question the entire community
>> should be involved.
>>
>> I think that a lot of the confusion comes because we're talking about
>> different ways time zone affect behavior: (1) parsing and (2) behavior when
>> changing time zones for processing data.
>>
>> It seems we agree that spark should eventually provide a timestamp type
>> which does conform to the standard.   The question is, how do we get
>> there?  Has spark already broken compliance so much that it's impossible to
>> go back without breaking user behavior?  Or perhaps spark already has
>> inconsistent behavior / broken compatibility within the 2.x line, so its
>> not unthinkable to have another breaking change?
>>
>> (Another part of the confusion is on me -- I believed the behavior change
>> was in 2.2, but actually it looks like its in 2.0.1.  That changes how we
>> think about this in context of what goes into a 2.2 release.  SPARK-18350
>> isn't the origin of the difference in behavior.)
>>
>> First: consider processing data that is already stored in tables, and
>> then accessing it from machines in different time zones.  The standard is
>> clear that "timestamp" should be just like "timestamp without time zone":
>> it does not represent one instant in time, rather it's always displayed the
>> same, regardless of time zone.  This was the behavior in spark 2.0.0 (and
>> 1.6),  for hive tables stored as text files, and for spark's json formats.
>>
>> Spark 2.0.1  changed the behavior of the json format (I believe
>> with SPARK-16216), so that it behaves more like timestamp *with* time
>> zone.  It also makes csv behave the same (timestamp in csv was basically
>> broken in 2.0.0).  However it did *not* change the behavior of a hive
>> textfile; it still behaves like "timestamp with*out* time zone".  Here's
>> some experiments I tried -- there are a bunch of files there for
>> completeness, but mostly focus on the difference between
>> query_output_2_0_0.txt vs. query_output_2_0_1.txt
>>
>> https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70
>>
>> Given that spark has changed this behavior post 2.0.0, is it still out of
>> the question to change this behavior to bring it back in line with the sql
>> standard for timestamp (without time zone) in the 2.x line?  Or, as reynold
>> proposes, is the only option at this point to add an off-by-default feature
>> flag to get "timestamp without time zone" semantics?
>>
>>
>> Second, there is the question of parsing strings into timestamp type.
>> I'm far less knowledgeable about this, so I mostly just have questions:
>>
>> * does the standard dictate what the parsing behavior should be for
>> timestamp (without time zone) when a time zone is present?
>>
>> * if it does and spark violates this standard is it worth trying to
>> retain the *other* semantics of timestamp without time zone, even if we
>> violate the parsing part?
>>
>> I did look at what postgres does for comparison:
>>
>>

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-06-01 Thread Michael Allman

I would suggest that making timestamp type behavior configurable and persisted 
per-table could introduce some real confusion, e.g. in queries involving tables 
with different timestamp type semantics.

I suggest starting with the assumption that timestamp type behavior is a 
per-session flag that can be set in a global `spark-defaults.conf` and consider 
more granular levels of configuration as people identify solid use cases.

Cheers,

Michael


> On May 30, 2017, at 7:41 AM, Zoltan Ivanfi  wrote:
> 
> Hi,
> 
> If I remember correctly, the TIMESTAMP type had UTC-normalized local time 
> semantics even before Spark 2, so I can understand that Spark considers it to 
> be the "established" behavior that must not be broken. Unfortunately, this 
> behavior does not provide interoperability with other SQL engines of the 
> Hadoop stack.
> 
> Let me summarize the findings of this e-mail thread so far:
> Timezone-agnostic TIMESTAMP semantics would be beneficial for 
> interoperability and SQL compliance.
> Spark can not make a breaking change. For backward-compatibility with 
> existing data, timestamp semantics should be user-configurable on a per-table 
> level.
> Before going into the specifics of a possible solution, do we all agree on 
> these points?
> 
> Thanks,
> 
> Zoltan
> 
> On Sat, May 27, 2017 at 8:57 PM Imran Rashid  > wrote:
> I had asked zoltan to bring this discussion to the dev list because I think 
> it's a question that extends beyond a single jira (we can't figure out the 
> semantics of timestamp in parquet if we don't k ow the overall goal of the 
> timestamp type) and since its a design question the entire community should 
> be involved.
> 
> I think that a lot of the confusion comes because we're talking about 
> different ways time zone affect behavior: (1) parsing and (2) behavior when 
> changing time zones for processing data.
> 
> It seems we agree that spark should eventually provide a timestamp type which 
> does conform to the standard.   The question is, how do we get there?  Has 
> spark already broken compliance so much that it's impossible to go back 
> without breaking user behavior?  Or perhaps spark already has inconsistent 
> behavior / broken compatibility within the 2.x line, so its not unthinkable 
> to have another breaking change?
> 
> (Another part of the confusion is on me -- I believed the behavior change was 
> in 2.2, but actually it looks like its in 2.0.1.  That changes how we think 
> about this in context of what goes into a 2.2 release.  SPARK-18350 isn't the 
> origin of the difference in behavior.)
> 
> First: consider processing data that is already stored in tables, and then 
> accessing it from machines in different time zones.  The standard is clear 
> that "timestamp" should be just like "timestamp without time zone": it does 
> not represent one instant in time, rather it's always displayed the same, 
> regardless of time zone.  This was the behavior in spark 2.0.0 (and 1.6),  
> for hive tables stored as text files, and for spark's json formats.
> 
> Spark 2.0.1  changed the behavior of the json format (I believe with 
> SPARK-16216), so that it behaves more like timestamp *with* time zone.  It 
> also makes csv behave the same (timestamp in csv was basically broken in 
> 2.0.0).  However it did *not* change the behavior of a hive textfile; it 
> still behaves like "timestamp with*out* time zone".  Here's some experiments 
> I tried -- there are a bunch of files there for completeness, but mostly 
> focus on the difference between query_output_2_0_0.txt vs. 
> query_output_2_0_1.txt
> 
> https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70 
> 
> 
> Given that spark has changed this behavior post 2.0.0, is it still out of the 
> question to change this behavior to bring it back in line with the sql 
> standard for timestamp (without time zone) in the 2.x line?  Or, as reynold 
> proposes, is the only option at this point to add an off-by-default feature 
> flag to get "timestamp without time zone" semantics?
> 
> 
> Second, there is the question of parsing strings into timestamp type.   I'm 
> far less knowledgeable about this, so I mostly just have questions:
> 
> * does the standard dictate what the parsing behavior should be for timestamp 
> (without time zone) when a time zone is present?
> 
> * if it does and spark violates this standard is it worth trying to retain 
> the *other* semantics of timestamp without time zone, even if we violate the 
> parsing part?
> 
> I did look at what postgres does for comparison:
> 
> https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c 
> 
> 
> spark's timestamp certainly does not match postgres's timestamp for parsing, 
> it seems closer to postgres's "timestamp with timezone" -- though

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-30 Thread Zoltan Ivanfi

Hi,

If I remember correctly, the TIMESTAMP type had UTC-normalized local time
semantics even before Spark 2, so I can understand that Spark considers it
to be the "established" behavior that must not be broken. Unfortunately,
this behavior does not provide interoperability with other SQL engines of
the Hadoop stack.

Let me summarize the findings of this e-mail thread so far:

   - Timezone-agnostic TIMESTAMP semantics would be beneficial for
   interoperability and SQL compliance.
   - Spark can not make a breaking change. For backward-compatibility with
   existing data, timestamp semantics should be user-configurable on a
   per-table level.

Before going into the specifics of a possible solution, do we all agree on
these points?

Thanks,

Zoltan

On Sat, May 27, 2017 at 8:57 PM Imran Rashid  wrote:

> I had asked zoltan to bring this discussion to the dev list because I
> think it's a question that extends beyond a single jira (we can't figure
> out the semantics of timestamp in parquet if we don't k ow the overall goal
> of the timestamp type) and since its a design question the entire community
> should be involved.
>
> I think that a lot of the confusion comes because we're talking about
> different ways time zone affect behavior: (1) parsing and (2) behavior when
> changing time zones for processing data.
>
> It seems we agree that spark should eventually provide a timestamp type
> which does conform to the standard.   The question is, how do we get
> there?  Has spark already broken compliance so much that it's impossible to
> go back without breaking user behavior?  Or perhaps spark already has
> inconsistent behavior / broken compatibility within the 2.x line, so its
> not unthinkable to have another breaking change?
>
> (Another part of the confusion is on me -- I believed the behavior change
> was in 2.2, but actually it looks like its in 2.0.1.  That changes how we
> think about this in context of what goes into a 2.2 release.  SPARK-18350
> isn't the origin of the difference in behavior.)
>
> First: consider processing data that is already stored in tables, and then
> accessing it from machines in different time zones.  The standard is clear
> that "timestamp" should be just like "timestamp without time zone": it does
> not represent one instant in time, rather it's always displayed the same,
> regardless of time zone.  This was the behavior in spark 2.0.0 (and 1.6),
>  for hive tables stored as text files, and for spark's json formats.
>
> Spark 2.0.1  changed the behavior of the json format (I believe
> with SPARK-16216), so that it behaves more like timestamp *with* time
> zone.  It also makes csv behave the same (timestamp in csv was basically
> broken in 2.0.0).  However it did *not* change the behavior of a hive
> textfile; it still behaves like "timestamp with*out* time zone".  Here's
> some experiments I tried -- there are a bunch of files there for
> completeness, but mostly focus on the difference between
> query_output_2_0_0.txt vs. query_output_2_0_1.txt
>
> https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70
>
> Given that spark has changed this behavior post 2.0.0, is it still out of
> the question to change this behavior to bring it back in line with the sql
> standard for timestamp (without time zone) in the 2.x line?  Or, as reynold
> proposes, is the only option at this point to add an off-by-default feature
> flag to get "timestamp without time zone" semantics?
>
>
> Second, there is the question of parsing strings into timestamp type.
> I'm far less knowledgeable about this, so I mostly just have questions:
>
> * does the standard dictate what the parsing behavior should be for
> timestamp (without time zone) when a time zone is present?
>
> * if it does and spark violates this standard is it worth trying to retain
> the *other* semantics of timestamp without time zone, even if we violate
> the parsing part?
>
> I did look at what postgres does for comparison:
>
> https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c
>
> spark's timestamp certainly does not match postgres's timestamp for
> parsing, it seems closer to postgres's "timestamp with timezone" -- though
> I dunno if that is standard behavior at all.
>
> thanks,
> Imran
>
> On Fri, May 26, 2017 at 1:27 AM, Reynold Xin  wrote:
>
>> That's just my point 4, isn't it?
>>
>>
>> On Fri, May 26, 2017 at 1:07 AM, Ofir Manor 
>> wrote:
>>
>>> Reynold,
>>> my point is that Spark should aim to follow the SQL standard instead of
>>> rolling its own type system.
>>> If I understand correctly, the existing implementation is similar to
>>> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle..
>>> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH
>>> TIMEZONE data types which are missing from Spark.
>>> So, it is better (for me) if instead of extending the existing types,
>>> Spark would just implement the additional

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-27 Thread Imran Rashid

I had asked zoltan to bring this discussion to the dev list because I think
it's a question that extends beyond a single jira (we can't figure out the
semantics of timestamp in parquet if we don't k ow the overall goal of the
timestamp type) and since its a design question the entire community should
be involved.

I think that a lot of the confusion comes because we're talking about
different ways time zone affect behavior: (1) parsing and (2) behavior when
changing time zones for processing data.

It seems we agree that spark should eventually provide a timestamp type
which does conform to the standard.   The question is, how do we get
there?  Has spark already broken compliance so much that it's impossible to
go back without breaking user behavior?  Or perhaps spark already has
inconsistent behavior / broken compatibility within the 2.x line, so its
not unthinkable to have another breaking change?

(Another part of the confusion is on me -- I believed the behavior change
was in 2.2, but actually it looks like its in 2.0.1.  That changes how we
think about this in context of what goes into a 2.2 release.  SPARK-18350
isn't the origin of the difference in behavior.)

First: consider processing data that is already stored in tables, and then
accessing it from machines in different time zones.  The standard is clear
that "timestamp" should be just like "timestamp without time zone": it does
not represent one instant in time, rather it's always displayed the same,
regardless of time zone.  This was the behavior in spark 2.0.0 (and 1.6),
 for hive tables stored as text files, and for spark's json formats.

Spark 2.0.1  changed the behavior of the json format (I believe
with SPARK-16216), so that it behaves more like timestamp *with* time
zone.  It also makes csv behave the same (timestamp in csv was basically
broken in 2.0.0).  However it did *not* change the behavior of a hive
textfile; it still behaves like "timestamp with*out* time zone".  Here's
some experiments I tried -- there are a bunch of files there for
completeness, but mostly focus on the difference between
query_output_2_0_0.txt vs. query_output_2_0_1.txt

https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70

Given that spark has changed this behavior post 2.0.0, is it still out of
the question to change this behavior to bring it back in line with the sql
standard for timestamp (without time zone) in the 2.x line?  Or, as reynold
proposes, is the only option at this point to add an off-by-default feature
flag to get "timestamp without time zone" semantics?

Second, there is the question of parsing strings into timestamp type.   I'm
far less knowledgeable about this, so I mostly just have questions:

* does the standard dictate what the parsing behavior should be for
timestamp (without time zone) when a time zone is present?

* if it does and spark violates this standard is it worth trying to retain
the *other* semantics of timestamp without time zone, even if we violate
the parsing part?

I did look at what postgres does for comparison:

https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c

spark's timestamp certainly does not match postgres's timestamp for
parsing, it seems closer to postgres's "timestamp with timezone" -- though
I dunno if that is standard behavior at all.

thanks,
Imran

On Fri, May 26, 2017 at 1:27 AM, Reynold Xin  wrote:

> That's just my point 4, isn't it?
>
>
> On Fri, May 26, 2017 at 1:07 AM, Ofir Manor  wrote:
>
>> Reynold,
>> my point is that Spark should aim to follow the SQL standard instead of
>> rolling its own type system.
>> If I understand correctly, the existing implementation is similar to
>> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle..
>> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH TIMEZONE
>> data types which are missing from Spark.
>> So, it is better (for me) if instead of extending the existing types,
>> Spark would just implement the additional well-defined types properly.
>> Just trying to copy-paste CREATE TABLE between SQL engines should not be
>> an exercise of flags and incompatibilities.
>>
>> Regarding the current behaviour, if I remember correctly I had to force
>> our spark O/S user into UTC so Spark wont change my timestamps.
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin  wrote:
>>
>>> Zoltan,
>>>
>>> Thanks for raising this again, although I'm a bit confused since I've
>>> communicated with you a few times on JIRA and on private emails to explain
>>> that you have some misunderstanding of the timestamp type in Spark and some
>>> of your statements are wrong (e.g. the except text file part). Not sure why
>>> you didn't get any of those.
>>>
>>>
>>> Here's another try:
>>>
>>>
>>> 1. I think you guys misunderstood the semantics of timestamp in Spark
>>> before session

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-26 Thread Reynold Xin

That's just my point 4, isn't it?


On Fri, May 26, 2017 at 1:07 AM, Ofir Manor  wrote:

> Reynold,
> my point is that Spark should aim to follow the SQL standard instead of
> rolling its own type system.
> If I understand correctly, the existing implementation is similar to
> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle..
> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH TIMEZONE
> data types which are missing from Spark.
> So, it is better (for me) if instead of extending the existing types,
> Spark would just implement the additional well-defined types properly.
> Just trying to copy-paste CREATE TABLE between SQL engines should not be
> an exercise of flags and incompatibilities.
>
> Regarding the current behaviour, if I remember correctly I had to force
> our spark O/S user into UTC so Spark wont change my timestamps.
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin  wrote:
>
>> Zoltan,
>>
>> Thanks for raising this again, although I'm a bit confused since I've
>> communicated with you a few times on JIRA and on private emails to explain
>> that you have some misunderstanding of the timestamp type in Spark and some
>> of your statements are wrong (e.g. the except text file part). Not sure why
>> you didn't get any of those.
>>
>>
>> Here's another try:
>>
>>
>> 1. I think you guys misunderstood the semantics of timestamp in Spark
>> before session local timezone change. IIUC, Spark has always assumed
>> timestamps to be with timezone, since it parses timestamps with timezone
>> and does all the datetime conversions with timezone in mind (it doesn't
>> ignore timezone if a timestamp string has timezone specified). The session
>> local timezone change further pushes Spark to that direction, but the
>> semantics has been with timezone before that change. Just run Spark on
>> machines with different timezone and you will know what I'm talking about.
>>
>> 2. CSV/Text is not different. The data type has always been "with
>> timezone". If you put a timezone in the timestamp string, it parses the
>> timezone.
>>
>> 3. We can't change semantics now, because it'd break all existing Spark
>> apps.
>>
>> 4. We can however introduce a new timestamp without timezone type, and
>> have a config flag to specify which one (with tz or without tz) is the
>> default behavior.
>>
>>
>>
>> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi  wrote:
>>
>>> Hi,
>>>
>>> Sorry if you receive this mail twice, it seems that my first attempt did
>>> not make it to the list for some reason.
>>>
>>> I would like to start a discussion about SPARK-18350
>>>  before it gets
>>> released because it seems to be going in a different direction than what
>>> other SQL engines of the Hadoop stack do.
>>>
>>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT
>>> TIME ZONE) to have timezone-agnostic semantics - basically a type that
>>> expresses readings from calendars and clocks and is unaffected by time
>>> zone. In the Hadoop stack, Impala has always worked like this and recently
>>> Presto also took steps 
>>> to become standards compliant. (Presto's design doc
>>> 
>>> also contains a great summary of the different semantics.) Hive has a
>>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
>>> source of incompatibility that is already being addressed
>>> ). A TIMESTAMP in
>>> SparkSQL, however, has UTC-normalized local time semantics (except for
>>> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE
>>> type.
>>>
>>> Given that timezone-agnostic TIMESTAMP semantics provide standards
>>> compliance and consistency with most SQL engines, I was wondering whether
>>> SparkSQL should also consider it in order to become ANSI SQL compliant and
>>> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
>>> adapt this semantics in the future, SPARK-18350
>>>  may turn out to be
>>> a source of problems. Please correct me if I'm wrong, but this change seems
>>> to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP
>>> type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP
>>> WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be
>>> better becoming timezone-agnostic instead of gaining further timezone-aware
>>> capabilities. (Of course becoming timezone-agnostic would be a behavior
>>> change, so it must be optional and configurable by the user, as in Presto.)
>>>
>>> I would like to hear your opinions about this concern and about
>>> TIMESTAMP

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-25 Thread Ofir Manor

Reynold,
my point is that Spark should aim to follow the SQL standard instead of
rolling its own type system.
If I understand correctly, the existing implementation is similar to
TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle..
In addition, there are the standard TIMESTAMP and TIMESTAMP WITH TIMEZONE
data types which are missing from Spark.
So, it is better (for me) if instead of extending the existing types, Spark
would just implement the additional well-defined types properly.
Just trying to copy-paste CREATE TABLE between SQL engines should not be an
exercise of flags and incompatibilities.

Regarding the current behaviour, if I remember correctly I had to force our
spark O/S user into UTC so Spark wont change my timestamps.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, May 25, 2017 at 1:33 PM, Reynold Xin  wrote:

> Zoltan,
>
> Thanks for raising this again, although I'm a bit confused since I've
> communicated with you a few times on JIRA and on private emails to explain
> that you have some misunderstanding of the timestamp type in Spark and some
> of your statements are wrong (e.g. the except text file part). Not sure why
> you didn't get any of those.
>
>
> Here's another try:
>
>
> 1. I think you guys misunderstood the semantics of timestamp in Spark
> before session local timezone change. IIUC, Spark has always assumed
> timestamps to be with timezone, since it parses timestamps with timezone
> and does all the datetime conversions with timezone in mind (it doesn't
> ignore timezone if a timestamp string has timezone specified). The session
> local timezone change further pushes Spark to that direction, but the
> semantics has been with timezone before that change. Just run Spark on
> machines with different timezone and you will know what I'm talking about.
>
> 2. CSV/Text is not different. The data type has always been "with
> timezone". If you put a timezone in the timestamp string, it parses the
> timezone.
>
> 3. We can't change semantics now, because it'd break all existing Spark
> apps.
>
> 4. We can however introduce a new timestamp without timezone type, and
> have a config flag to specify which one (with tz or without tz) is the
> default behavior.
>
>
>
> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi  wrote:
>
>> Hi,
>>
>> Sorry if you receive this mail twice, it seems that my first attempt did
>> not make it to the list for some reason.
>>
>> I would like to start a discussion about SPARK-18350
>>  before it gets
>> released because it seems to be going in a different direction than what
>> other SQL engines of the Hadoop stack do.
>>
>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT TIME
>> ZONE) to have timezone-agnostic semantics - basically a type that expresses
>> readings from calendars and clocks and is unaffected by time zone. In the
>> Hadoop stack, Impala has always worked like this and recently Presto also
>> took steps  to become
>> standards compliant. (Presto's design doc
>> 
>> also contains a great summary of the different semantics.) Hive has a
>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
>> source of incompatibility that is already being addressed
>> ). A TIMESTAMP in
>> SparkSQL, however, has UTC-normalized local time semantics (except for
>> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE
>> type.
>>
>> Given that timezone-agnostic TIMESTAMP semantics provide standards
>> compliance and consistency with most SQL engines, I was wondering whether
>> SparkSQL should also consider it in order to become ANSI SQL compliant and
>> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
>> adapt this semantics in the future, SPARK-18350
>>  may turn out to be a
>> source of problems. Please correct me if I'm wrong, but this change seems
>> to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP
>> type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP
>> WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be
>> better becoming timezone-agnostic instead of gaining further timezone-aware
>> capabilities. (Of course becoming timezone-agnostic would be a behavior
>> change, so it must be optional and configurable by the user, as in Presto.)
>>
>> I would like to hear your opinions about this concern and about TIMESTAMP
>> semantics in general. Does the community agree that a standards-compliant
>> and interoperable TIMESTAMP type is desired? Do you perceive SPARK-18350 as
>> a potential problem in achieving this or do I misunderstand the effects

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-25 Thread Zoltan Ivanfi

Hi,

Ofir, thanks for your support. My understanding is that many users have the
same problem as you do.

Reynold, thanks for your reply and sorry for the confusion. My personal
e-mail was specifically about your concerns regarding SPARK-12297 and I
started this separate thread because this is about the general vision
regarding the TIMESTAMP type which may be of interest to the whole
community. My initial e-mail did not address your concerns because I wrote
it before you answered on the other thread.

Regarding your specific concerns:

1. I realize that the TIMESTAMP type in Spark already has UTC-normalized
local time semantics, but I believe that this is problematic for
consistency and interoperability with other SQL engines. In my opinion a
standard-compliant behavior would be the best and since SPARK-18350 takes
SparkSQL even further away from it, I am worried that it makes fixing this
incompatibility even harder.

2. If a timezone is present in a textfile, SparkSQL can parse it indeed.
However, if there is no specific timezone mentioned, it will parse the
TIMESTAMP as a local time, and when the result is displayed to the user
(without the timezone), it will be identical regardless of the current
timezone. This actually matches the way how Hive approximates a
timezone-agnostic TIMESTAMP behavior. Since Hive's in-memory timestamp
representation is UTC-normalized local time (similar to Spark), reading
timestamps in different timezones will result in a different UTC value in
the in-memory representation. However, when they are rendered, they will
look the same, so the apparent behavior will match the desired
timezone-agnostic semantics. (The reason why this is only an approximation
is that timestamps skipped due to DST changes can not be represented this
way.)

But even if we consider textfile to be no exception, it is still not
SQL-compliant that the TIMESTAMP type has TIMESTAMP WITH TIME ZONE
semantics.

3. I agree that Spark must not break compatibility in the interpretation of
already existing data, but I don't think that it means that we can't change
semantics now. It just means that we have to make it configurable, as I
suggested in the initial mail of this thread.

Actually, the requirement of never breaking compatibility is the exact
reason why I'm worried about SPARK-18350, since if people start using that
feature, it will be even harder to change semantics while keeping
compatibility at the same time. (On the other hand, SPARK-18350 would be an
essential feature for a separate TIMESTAMP WITH TIME ZONE type.)

4. The ability to choose the desired behavior of a TIMESTAMP as you suggest
actually solves the problem of breaking compatibility. However, I don't
think that a central configuration flag is enough. Since users who already
have timestamp data may also want to have standard-compliant behavior for
new tables, I think there needs to be a table-specific override for the
global configuration flag. In fact, that is what we wanted to achieve in
SPARK-12297, although our effort was limited to the Parquet format.

Zoltan

On Thu, May 25, 2017 at 12:33 PM Reynold Xin  wrote:

> Zoltan,
>
> Thanks for raising this again, although I'm a bit confused since I've
> communicated with you a few times on JIRA and on private emails to explain
> that you have some misunderstanding of the timestamp type in Spark and some
> of your statements are wrong (e.g. the except text file part). Not sure why
> you didn't get any of those.
>
>
> Here's another try:
>
>
> 1. I think you guys misunderstood the semantics of timestamp in Spark
> before session local timezone change. IIUC, Spark has always assumed
> timestamps to be with timezone, since it parses timestamps with timezone
> and does all the datetime conversions with timezone in mind (it doesn't
> ignore timezone if a timestamp string has timezone specified). The session
> local timezone change further pushes Spark to that direction, but the
> semantics has been with timezone before that change. Just run Spark on
> machines with different timezone and you will know what I'm talking about.
>
> 2. CSV/Text is not different. The data type has always been "with
> timezone". If you put a timezone in the timestamp string, it parses the
> timezone.
>
> 3. We can't change semantics now, because it'd break all existing Spark
> apps.
>
> 4. We can however introduce a new timestamp without timezone type, and
> have a config flag to specify which one (with tz or without tz) is the
> default behavior.
>
>
>
> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi  wrote:
>
>> Hi,
>>
>> Sorry if you receive this mail twice, it seems that my first attempt did
>> not make it to the list for some reason.
>>
>> I would like to start a discussion about SPARK-18350
>>  before it gets
>> released because it seems to be going in a different direction than what
>> other SQL engines of the Hadoop stack

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-25 Thread Reynold Xin

Zoltan,

Thanks for raising this again, although I'm a bit confused since I've
communicated with you a few times on JIRA and on private emails to explain
that you have some misunderstanding of the timestamp type in Spark and some
of your statements are wrong (e.g. the except text file part). Not sure why
you didn't get any of those.

Here's another try:

1. I think you guys misunderstood the semantics of timestamp in Spark
before session local timezone change. IIUC, Spark has always assumed
timestamps to be with timezone, since it parses timestamps with timezone
and does all the datetime conversions with timezone in mind (it doesn't
ignore timezone if a timestamp string has timezone specified). The session
local timezone change further pushes Spark to that direction, but the
semantics has been with timezone before that change. Just run Spark on
machines with different timezone and you will know what I'm talking about.

2. CSV/Text is not different. The data type has always been "with
timezone". If you put a timezone in the timestamp string, it parses the
timezone.

3. We can't change semantics now, because it'd break all existing Spark
apps.

4. We can however introduce a new timestamp without timezone type, and have
a config flag to specify which one (with tz or without tz) is the default
behavior.

On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi  wrote:

> Hi,
>
> Sorry if you receive this mail twice, it seems that my first attempt did
> not make it to the list for some reason.
>
> I would like to start a discussion about SPARK-18350
>  before it gets
> released because it seems to be going in a different direction than what
> other SQL engines of the Hadoop stack do.
>
> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT TIME
> ZONE) to have timezone-agnostic semantics - basically a type that expresses
> readings from calendars and clocks and is unaffected by time zone. In the
> Hadoop stack, Impala has always worked like this and recently Presto also
> took steps  to become
> standards compliant. (Presto's design doc
> 
> also contains a great summary of the different semantics.) Hive has a
> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
> source of incompatibility that is already being addressed
> ). A TIMESTAMP in
> SparkSQL, however, has UTC-normalized local time semantics (except for
> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE
> type.
>
> Given that timezone-agnostic TIMESTAMP semantics provide standards
> compliance and consistency with most SQL engines, I was wondering whether
> SparkSQL should also consider it in order to become ANSI SQL compliant and
> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
> adapt this semantics in the future, SPARK-18350
>  may turn out to be a
> source of problems. Please correct me if I'm wrong, but this change seems
> to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP
> type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP
> WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be
> better becoming timezone-agnostic instead of gaining further timezone-aware
> capabilities. (Of course becoming timezone-agnostic would be a behavior
> change, so it must be optional and configurable by the user, as in Presto.)
>
> I would like to hear your opinions about this concern and about TIMESTAMP
> semantics in general. Does the community agree that a standards-compliant
> and interoperable TIMESTAMP type is desired? Do you perceive SPARK-18350 as
> a potential problem in achieving this or do I misunderstand the effects of
> this change?
>
> Thanks,
>
> Zoltan
>
> ---
>
> List of links in case in-line links do not work:
>
>-
>
>SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350
>-
>
>Presto's change: https://github.com/prestodb/presto/issues/7122
>-
>
>Presto's design doc: https://docs.google.com/document/d/
>1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit
>
> 
>
>
>

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-25 Thread Ofir Manor

Hi Zoltan,
thanks for bringing this up, this is really important to me!
Personally, as a user developing app on top of Spark and other tools, the
current timestamp semantics has been a source of some pain - needing to
undo Spark's "auto-correcting" of timestamps .
It would be really great if we could have standard timestamp handling, like
every other SQL-compliant database and processing engine (choosing between
the two main SQL types). I was under the impression that better SQL
compliant was one of the top priorities of the Spark project.
I guess it is pretty lake in the release cycle - but it seems SPARK-18350
was just introduced a couple of weeks ago. Maybe it should be reverted to
unblock the 2.2 release, and a more proper solution could be implemented
for the next release after a more comprehensive discussion?
Just my two cents,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Wed, May 24, 2017 at 6:46 PM, Zoltan Ivanfi  wrote:

> Hi,
>
> Sorry if you receive this mail twice, it seems that my first attempt did
> not make it to the list for some reason.
>
> I would like to start a discussion about SPARK-18350
>  before it gets
> released because it seems to be going in a different direction than what
> other SQL engines of the Hadoop stack do.
>
> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT TIME
> ZONE) to have timezone-agnostic semantics - basically a type that expresses
> readings from calendars and clocks and is unaffected by time zone. In the
> Hadoop stack, Impala has always worked like this and recently Presto also
> took steps  to become
> standards compliant. (Presto's design doc
> 
> also contains a great summary of the different semantics.) Hive has a
> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
> source of incompatibility that is already being addressed
> ). A TIMESTAMP in
> SparkSQL, however, has UTC-normalized local time semantics (except for
> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE
> type.
>
> Given that timezone-agnostic TIMESTAMP semantics provide standards
> compliance and consistency with most SQL engines, I was wondering whether
> SparkSQL should also consider it in order to become ANSI SQL compliant and
> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
> adapt this semantics in the future, SPARK-18350
>  may turn out to be a
> source of problems. Please correct me if I'm wrong, but this change seems
> to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP
> type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP
> WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be
> better becoming timezone-agnostic instead of gaining further timezone-aware
> capabilities. (Of course becoming timezone-agnostic would be a behavior
> change, so it must be optional and configurable by the user, as in Presto.)
>
> I would like to hear your opinions about this concern and about TIMESTAMP
> semantics in general. Does the community agree that a standards-compliant
> and interoperable TIMESTAMP type is desired? Do you perceive SPARK-18350 as
> a potential problem in achieving this or do I misunderstand the effects of
> this change?
>
> Thanks,
>
> Zoltan
>
> ---
>
> List of links in case in-line links do not work:
>
>-
>
>SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350
>-
>
>Presto's change: https://github.com/prestodb/presto/issues/7122
>-
>
>Presto's design doc: https://docs.google.com/document/d/
>1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit
>
> 
>
>
>

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Re: SQL TIMESTAMP semantics vs. SPARK-18350

12 matches

Site Navigation

Mail list logo

Footer information