[
https://issues.apache.org/jira/browse/IMPALA-5942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16842108#comment-16842108
]
Gabor Kaszab commented on IMPALA-5942:
--------------------------------------
I had a short discussion about this topic with [~grahn] the other day as I ran
into handling dateless timestamp with regard to IMPALA-4018. I think we have to
consider 2 different scenarios here:
1) When a datetime pattern is given without specifying the date.
{code:java}
+--------------------------------------+
| to_timestamp('01:50:00', 'hh:mm:ss') |
+--------------------------------------+
| 01:50:00 |
+--------------------------------------+
{code}
In this case I think Impala should reject the query with an error during the
pattern analysis. (Note, by analysis I don't mean query analysis in the
frontend, rather the parsing of the format in the backend.)
2) When no datetime pattern is given but the actual input is dateless.
{code:java}
select cast(field_name as timestamp) from table_name;
{code}
{code:java}
insert into table2_with_timestamp_col select string_col_storing_timestamps from
table1;
{code}
Here, we can't reject the query with an error as we have no knowledge on the
data that the query is run on. The options we have here is:
- return null for dateless timestamp values
- default their date part to some hardcoded date (such as the smallest date
Impala's timestamp can hold.)
- default their date part to current date
My least favourite is the 3r because we would end up having different results
for the same query depending on when we run it.
Between the first two I feel returning null as the cleaner solution but this
is not based on scientific reasoning or such just my impression.
According to Greg there are no known users who rely on dateless timestamps as
that is kind of an edge case. So I have one question that bothers me:
Isn't this considered a breaking change? Are we flexible enough to deliver
something like this in a minor release?
> Dateless timestamps (e.g. "10:00:00") are handled inconsistently
> -----------------------------------------------------------------
>
> Key: IMPALA-5942
> URL: https://issues.apache.org/jira/browse/IMPALA-5942
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 2.11.0
> Reporter: Csaba Ringhofer
> Priority: Major
> Labels: timestamp
>
> Impala cannot read back these timestamps from Parquet, while it can read
> them back from textfiles.
> According to the documentation, Impala should be able to handle these values
> somehow, as the examples contain "select cast('08:30:00' as timestamp);"
> see http://impala.apache.org/docs/build/html/topics/impala_timestamp.html
> {code}
> text:
> create table TT1 (t timestamp);
> insert into TT1 (t) values ("10:00:00");
> select * from TT1;
> +----------+
> | t |
> +----------+
> | 10:00:00 |
> +----------+
> parquet:
> create table TT2(t timestamp) STORED AS PARQUET;
> insert into TT2 (t) values ("10:00:00");
> select * from TT2;
> +------+
> | t |
> +------+
> | NULL |
> +------+
> WARNINGS: Parquet file
> 'hdfs://localhost:20500/test-warehouse/tt2/714d741212df3180-cd4e670800000000_226739479_data.0.parq'
> column 't' contains an out of range timestamp. The valid date range is
> 1400-01-01..9999-12-31.
> {code}
> I think that this is a side effect of the fix of IMPALA-4363, but I did not
> check what happens in versions that did not contain this fix.
> UPDATE: I have checked the last commit before the fix of IMPALA-4363, and it
> does not have this bug.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]