[jira] [Comment Edited] (PHOENIX-5066) The TimeZone is incorrectly used during writing or reading data

Istvan Toth (Jira) Tue, 07 Feb 2023 04:39:08 -0800


    [ 
https://issues.apache.org/jira/browse/PHOENIX-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684839#comment-17684839
 ]


Istvan Toth edited comment on PHOENIX-5066 at 2/7/23 12:38 PM:
---------------------------------------------------------------

I have found some new cases where the timezone to be used is not trivial (some 
of these are discussed above)

Most of these stem from the fact that the only temporal types Phoenix handles 
are nanosecond ({*}TIMESTAMP{*}) and millisecond (everything else) resolution 
*TIMESTAMP WITH TIME ZONE* types (i.e. java.time.Instant ), and every other 
type that is supposed to be java.time.LocalDate like is instead badly faked, 
but is internally still an Instant.
So we cannot actually store '2000-01-01' as a date, we can only store 
'2000-01-01 00:00:00.000 GMT' as an Instant (aka TIMESTAMP WITH TIME ZONE ).

1. default/min/max values specified in DDL
{noformat}
create table b1 (id integer primary key, v1 date default '2000-01-01 01:01:01');
create table b1 (id integer primary key, v1 date default '2000-01-01 01:01:01', 
v2 date default (round(time'2000-01-01 01:01:01', 'DAY')));{noformat}
The default expression is written verbatim to the COLUMN_DEF column.

 

2. view constants specified in DDL
{noformat}
create view v1 as select * from b1 where v1=DATE '2000-01-01 
01:01:01';{noformat}
The statement is written verbatim to the VIEW_STATEMENT column.

 

3. functional indexes 
{noformat}
create index i1 on v1 (round(v1, 'MONTH'));{noformat}
Each of these problematic.

In each case, we actually tokenize the expression into a Expression tree, and 
the re-generate the string representation before writing to system.catalog.
While tokenzing, we are also converting the temporal literals to GMT.

There are two things we could do:

A.) Write the temporal literal string as GMT

B.) Write the temporal literal string in the client TZ

B.) would be a very bad idea, as we'd replicate the same chaos with clients in 
different timezones that we have now.

However, if we use GMT, we need to make sure that we interpret these strings 
according to GMT everywhere in the code, even on the client side. ( for 
example, when generating functional indexes for immutable tables (I'm not sure 
that those still use the local code path))

This has a few ramifications:

Temporal default values will not match with the local timezone, so a default 
that was intended to be at midnight local TZ would some other time instead.
There is nothing we can do to fix this, without implementing real SQL compliant 
temporal types.

The same goes for view constants, every client gets identical results, but 
those won't actually match the timezone they use.
Same issue, unless we implement real SQL types, there is nothing we can do.
This is the least bad case, as in this case data is never written, so there is 
no possibility of messing up the timezone of the written data.

The functional indexes are the most problematic, and we should possibly prevent 
users from using temporal functions in them.
If we weren't converting to GMT, and would calculate the result according to 
the client TZ, then the same value upserted from different TZs
would generate different indexes, and queries would return arbitrary results.

If we use GMT for the index calculations, then the results will at least be 
consistent, even if they they will be incorrect if the local TZ is not GMT.
Still, running the same query with and without indexes WILL return different 
results, and there is nothing we can do about that.

I also gave some thought to how the standard and other databases avoid these 
problems, and how we could implement the same
(thought experiment, NOT INTENDED to be implemented in this ticket, or ever)

Provided we kept the internal representation, we'd need to do the effectively 
following when reading a temporal from the DB:
1. Take GMT timestamp stored
2. Convert into string representation in GMT as date, time, or datetime
3. Parse the result in the client timezone, and keep only the 
date/time/timestamp part

For writing to DB:
1. Take the local TZ date
2. Convert to string representation in the TZ
3. Parse the result in GMT, and keep only the date/time/timezone part.

This would get rid of the extra precision, and would treat the stored values as 
Local* types, instead of Instant.

The other side of the coin is that according the SQL standard ONLY the 
TIMESTAMP WITH TIME ZONE stores actual Instants, and reading the same
TIME or DATE field will return the same calendar date, bit those will resolve 
to very different instants, which can also be bad for some applications.

 


was (Author: stoty):
I have found some new cases where the timezone to be used is not trivial (some 
of these are discussed above)

Most of these stem from the fact that the only temporal types Phoenix handles 
are nanosecond ({*}TIMESTAMP) and millisecond (everything else) resolution 
*TIMESTAMP WITH TIME ZONE{*} (i.e. java.time.Instant ), and every other type 
that is supposed to be java.time.LocalDate like is instead badly faked, but is 
internally still an Instant.
So we cannot actually store '2000-01-01' as a date, we can only store 
'2000-01-01 00:00:00.000 GMT' as an Instant (aka TIMESTAMP WITH TIME ZONE ).

1. default/min/max values specified in DDL
{noformat}
create table b1 (id integer primary key, v1 date default '2000-01-01 01:01:01');
create table b1 (id integer primary key, v1 date default '2000-01-01 01:01:01', 
v2 date default (round(time'2000-01-01 01:01:01', 'DAY')));{noformat}
The default expression is written verbatim to the COLUMN_DEF column.

 

 

2. view constants specified in DDL
{noformat}
create view v1 as select * from b1 where v1=DATE '2000-01-01 
01:01:01';{noformat}
The statement is written verbatim to the VIEW_STATEMENT column.

 

 

3. functional indexes 

 
{noformat}
create index i1 on v1 (round(v1, 'MONTH'));{noformat}
Each of these problematic.

In each case, we actually tokenize the expression into a Expression tree, and 
the re-generate the string representation before writing to system.catalog.
While tokenzing, we are also converting the temporal literals to GMT.

There are two things we could do:

A.) Write the temporal literal string as GMT

B.) Write the temporal literal string in the client TZ

B.) would be a very bad idea, as we'd replicate the same chaos with clients in 
different timezones that we have now.

However, if we use GMT, we need to make sure that we interpret these strings 
according to GMT everywhere in the code, even on the client side. ( for 
example, when generating functional indexes for immutable tables (I'm not sure 
that those still use the local code path))

This has a few ramifications:

Temporal default values will not match with the local timezone, so a default 
that was intended to be at midnight local TZ would some other time instead.
There is nothing we can do to fix this, without implementing real SQL compliant 
temporal types.

The same goes for view constants, every client gets identical results, but 
those won't actually match the timezone they use.
Same issue, unless we implement real SQL types, there is nothing we can do.
This is the least bad case, as in this case data is never written, so there is 
no possibility of messing up the timezone of the written data.

The functional indexes are the most problematic, and we should possibly prevent 
users from using temporal functions in them.
If we weren't converting to GMT, and would calculate the result according to 
the client TZ, then the same value upserted from different TZs
would generate different indexes, and queries would return arbitrary results.

If we use GMT for the index calculations, then the results will at least be 
consistent, even if they they will be incorrect if the local TZ is not GMT.
Still, running the same query with and without indexes WILL return different 
results, and there is nothing we can do about that.

I also gave some thought to how the standard and other databases avoid these 
problems, and how we could implement the same
(thought experiment, NOT INTENDED to be implemented in this ticket, or ever)

Provided we kept the internal representation, we'd need to do the effectively 
following when reading a temporal from the DB:
1. Take GMT timestamp stored
2. Convert into string representation in GMT as date, time, or datetime
3. Parse the result in the client timezone, and keep only the 
date/time/timestamp part

For writing to DB:
1. Take the local TZ date
2. Convert to string representation in the TZ
3. Parse the result in GMT, and keep only the date/time/timezone part.

This would get rid of the extra precision, and would treat the stored values as 
Local* types, instead of Instant.

The other side of the coin is that according the SQL standard ONLY the 
TIMESTAMP WITH TIME ZONE stores actual Instants, and reading the same
TIME or DATE field will return the same calendar date, bit those will resolve 
to very different instants, which can also be bad for some applications.

 

> The TimeZone is incorrectly used during writing or reading data
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-5066
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5066
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 5.0.0, 4.14.1
>            Reporter: Jaanai Zhang
>            Assignee: Istvan Toth
>            Priority: Critical
>             Fix For: 5.3.0
>
>         Attachments: DateTest.java, PHOENIX-5066.4x.v1.patch, 
> PHOENIX-5066.4x.v2.patch, PHOENIX-5066.4x.v3.patch, 
> PHOENIX-5066.master.v1.patch, PHOENIX-5066.master.v2.patch, 
> PHOENIX-5066.master.v3.patch, PHOENIX-5066.master.v4.patch, 
> PHOENIX-5066.master.v5.patch, PHOENIX-5066.master.v6.patch
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> We have two methods to write data when uses JDBC API.
> #1. Uses _the exceuteUpdate_ method to execute a string that is an upsert SQL.
> #2. Uses the _prepareStatement_ method to set some objects and execute.
> The _string_ data needs to convert to a new object by the schema information 
> of tables. we'll use some date formatters to convert string data to object 
> for Date/Time/Timestamp types when writes data and the formatters are used 
> when reads data as well.
>  
> *Uses default timezone test*
>  Writing 3 records by the different ways.
> {code:java}
> UPSERT INTO date_test VALUES (1,'2018-12-10 15:40:47','2018-12-10 
> 15:40:47','2018-12-10 15:40:47') 
> UPSERT INTO date_test VALUES (2,to_date('2018-12-10 
> 15:40:47'),to_time('2018-12-10 15:40:47'),to_timestamp('2018-12-10 15:40:47'))
> stmt.setInt(1, 3);stmt.setDate(2, date);stmt.setTime(3, 
> time);stmt.setTimestamp(4, ts);
> {code}
> Reading the table by the getObject(getDate/getTime/getTimestamp) methods.
> {code:java}
> 1 | 2018-12-10 | 23:45:07 | 2018-12-10 23:45:07.0 
> 2 | 2018-12-10 | 23:45:07 | 2018-12-10 23:45:07.0 
> 3 | 2018-12-10 | 15:45:07 | 2018-12-10 15:45:07.66 
> {code}
> Reading the table by the getString methods 
> {code:java}
> 1 | 2018-12-10 15:45:07.000 | 2018-12-10 15:45:07.000 | 2018-12-10 
> 15:45:07.000 
> 2 | 2018-12-10 15:45:07.000 | 2018-12-10 15:45:07.000 | 2018-12-10 
> 15:45:07.000 
> 3 | 2018-12-10 07:45:07.660 | 2018-12-10 07:45:07.660 | 2018-12-10 
> 07:45:07.660
> {code}
>  *Uses GMT+8 test*
>  Writing 3 records by the different ways.
> {code:java}
> UPSERT INTO date_test VALUES (1,'2018-12-10 15:40:47','2018-12-10 
> 15:40:47','2018-12-10 15:40:47')
> UPSERT INTO date_test VALUES (2,to_date('2018-12-10 
> 15:40:47'),to_time('2018-12-10 15:40:47'),to_timestamp('2018-12-10 15:40:47'))
> stmt.setInt(1, 3);stmt.setDate(2, date);stmt.setTime(3, 
> time);stmt.setTimestamp(4, ts);
> {code}
> Reading the table by the getObject(getDate/getTime/getTimestamp) methods.
> {code:java}
> 1 | 2018-12-10 | 23:40:47 | 2018-12-10 23:40:47.0 
> 2 | 2018-12-10 | 15:40:47 | 2018-12-10 15:40:47.0 
> 3 | 2018-12-10 | 15:40:47 | 2018-12-10 15:40:47.106 {code}
> Reading the table by the getString methods
> {code:java}
>  1 | 2018-12-10 23:40:47.000 | 2018-12-10 23:40:47.000 | 2018-12-10 
> 23:40:47.000
> 2 | 2018-12-10 15:40:47.000 | 2018-12-10 15:40:47.000 | 2018-12-10 
> 15:40:47.000
> 3 | 2018-12-10 15:40:47.106 | 2018-12-10 15:40:47.106 | 2018-12-10 
> 15:40:47.106
> {code}
>  
> _We_ have a historical problem,  we'll parse the string to 
> Date/Time/Timestamp objects with timezone in #1, which means the actual data 
> is going to be changed when stored in HBase table。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (PHOENIX-5066) The TimeZone is incorrectly used during writing or reading data

Reply via email to