[
https://issues.apache.org/jira/browse/DRILL-4763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15364556#comment-15364556
]
Paul Rogers commented on DRILL-4763:
------------------------------------
It seems that Drill's policy for dates is to treat the data as a "bucket of
bits." The user is required to tell Drill that, in this query, for this file,
please treat the data as a date. The user does this using a convert function.
(I have, however, not yet fully tested the conversions to see if they help in
this specific case.)
The specific request here is for Drill to do the conversion automatically for
the reasons cited above. 1) There is only one "right" way to do the conversion
of a date, so Drill might as well do it rather than each and every query or
view.
Note that this is a symtom of a larger problem: Drill does not undestand
Parquet logical types. A similar problem occurs with Parquet inteval types
(which I have not yet fully tested.)
> Parquet file with DATE logical type produces wrong results for simple SELECT
> ----------------------------------------------------------------------------
>
> Key: DRILL-4763
> URL: https://issues.apache.org/jira/browse/DRILL-4763
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Data Types
> Affects Versions: 1.6.0
> Reporter: Paul Rogers
> Assignee: Vitalii Diravka
> Attachments: date.parquet, int_16.parquet
>
>
> Created a simple Parquet file with the following schema:
> message test { required int32 index; required int32 value (DATE); required
> int32 raw; }
> That is, a file with an int32 storage type and a DATE logical type. Then,
> created a number of test values:
> 0 (which should be interpreted as 1970-01-01) and
> (int) (System.currentTimeMillis() / (24*60*60*1000) ) Which should be
> interpreted as the number of days since 1970-01-01 and today.
> According to the Parquet spec
> (https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md),
> Parquet dates are expressed as "the number of days from the Unix epoch, 1
> January 1970."
> Java timestamps are expressed as "measured in milliseconds, between the
> current time and midnight, January 1, 1970 UTC."
> There is ambiguity here: Parquet dates are presumably local times not
> absolute times, so the math above will actually tell us the date in London
> right now, but that's close enough.
> Generate the local file to date.parquet. Query it with:
> SELECT * from `local`.`root`.`date.parquet`;
> The results are incorrect:
> index value raw
> 1 -11395-10-18T00:00:00.000-07:52:58 0
> Here, we have a value of 0. The displayed date is decidedly not
> 1970-01-01T00:00:00. We actually have many problems:
> 1. The date is far off.
> 2. The output shows time. But, the Parquet DATE format explcitly does NOT
> include time, so it makes no sense to include it.
> 3. The output attempts to show a time zone, but a time zone of -07:52:58,
> while close to PST, is not right (there is no timezine that is of by 7:02
> from UTC.)
> 4. The data has no time zone, Parquet DATE explicilty is a local time, so it
> is impossible to know the relationship between that date an UTC.
> The correct output (in ISO format) would be: 1970-01-01
> The last line should be today's date, but instead is:
> 6 -11348-04-20T00:00:00.000-07:52:58 16986
> Expected:
> 2016-07-04
> Note that all the information to produce the right information is available
> to Drill:
> 1. The DATE annotation says the meaning of the signed 32-bit integer.
> 2. Given the starting point and duration in days, the conversion to Drill's
> own internal date format is unambiguous.
> 3. The DATE annotation says that the date is local, so Drill should not
> attempt to convert to UTC. (That is, a Java Date object can't be used,
> instead a Joda/Java 8 LocalDate is necessary.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)