[ https://issues.apache.org/jira/browse/DRILL-4763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15422785#comment-15422785 ]
Vitalii Diravka edited comment on DRILL-4763 at 8/16/16 2:34 PM: ----------------------------------------------------------------- The following logic is used to calculate parquet date in drill now: {code} (Julian_day)*2 = unix_first_day ((4713_BE+1970)*365,26)*2 = 4881176 {code} According to drill doc should use the following logic: {code} (Julian_day +_1970)*365,26 = unix_first_day (4713_BE+1970)*365,26 = 2457615 {code} According to parquet doc should use the following logic which is the right case: {code} unix_first_day = 0 {code} *For example:* Parquet file created from hive: {code} hive> select * from test_parquet; OK 1970-01-05 17:51 Visakh Time taken: 0.046 seconds, Fetched: 1 row(s) {code} {code} vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar cat /tmp/parquetFolder/test_parquet/ dt = 4 tm = 17:51 nm = Visakh {code} {code} Running org.apache.drill.exec.store.parquet.columnreaders.TestDateReader#testParquetDate SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 1 row(s): ------------------------------------------------------------------------------------------------------------------------------- | dt<DATE(OPTIONAL)> | tm<VARCHAR(OPTIONAL)> | nm<VARCHAR(OPTIONAL)> | ------------------------------------------------------------------------------------------------------------------------------- | -11395-10-22T00:00:00.000Z | 17:51 | Visakh | ------------------------------------------------------------------------------------------------------------------------------- Total record count: 1 {code} Parquet file created from drill: {code} 0: jdbc:drill:zk=local> select * from drill_parquet; +---------------+-----------------+ | current_date | unix_first_day | +---------------+-----------------+ | 2016-08-15 | 1970-01-01 | +---------------+-----------------+ 1 row selected (0.142 seconds) {code} {code} vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar cat /tmp/drill_parquet/ current_date = 4898204 unix_first_day = 4881176 {code} *With fix:* {code} 0: jdbc:drill:zk=local> create table drill_parquet_with_fix as SELECT current_date, CAST('1970-01-05' as date) as unix_fifth_day, CAST('1970-01-01' as date) as unix_first_day FROM (VALUES(1)); +-----------+----------------------------+ | Fragment | Number of records written | +-----------+----------------------------+ | 0_0 | 1 | +-----------+----------------------------+ 1 row selected (0.257 seconds) 0: jdbc:drill:zk=local> select * from drill_parquet_with_fix; +---------------+-----------------+-----------------+ | current_date | unix_fifth_day | unix_first_day | +---------------+-----------------+-----------------+ | 2016-08-15 | 1970-01-05 | 1970-01-01 | +---------------+-----------------+-----------------+ 1 row selected (0.174 seconds) {code} {code} vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar cat /tmp/drill_parquet_with_fix current_date = 17028 unix_fifth_day = 4 unix_first_day = 0 {code} was (Author: vitalii): Such logic is used to calculate parquet date in drill now: {code} (Julian_day)*2 = unix_first_day ((4713_BE+1970)*365,26)*2 = 4881176 {code} Accordingly drill doc must be: {code} (Julian_day +_1970)*365,26 = unix_first_day (4713_BE+1970)*365,26 = 2457615 {code} Accordinly parquet doc must be (right case): {code} unix_first_day = 0 {code} *For example:* Parquet file created from hive: {code} hive> select * from test_parquet; OK 1970-01-05 17:51 Visakh Time taken: 0.046 seconds, Fetched: 1 row(s) {code} {code} vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar cat /tmp/parquetFolder/test_parquet/ dt = 4 tm = 17:51 nm = Visakh {code} {code} Running org.apache.drill.exec.store.parquet.columnreaders.TestDateReader#testParquetDate SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 1 row(s): ------------------------------------------------------------------------------------------------------------------------------- | dt<DATE(OPTIONAL)> | tm<VARCHAR(OPTIONAL)> | nm<VARCHAR(OPTIONAL)> | ------------------------------------------------------------------------------------------------------------------------------- | -11395-10-22T00:00:00.000Z | 17:51 | Visakh | ------------------------------------------------------------------------------------------------------------------------------- Total record count: 1 {code} Parquet file created from drill: {code} 0: jdbc:drill:zk=local> select * from drill_parquet; +---------------+-----------------+ | current_date | unix_first_day | +---------------+-----------------+ | 2016-08-15 | 1970-01-01 | +---------------+-----------------+ 1 row selected (0.142 seconds) {code} {code} vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar cat /tmp/drill_parquet/ current_date = 4898204 unix_first_day = 4881176 {code} *With fix:* {code} 0: jdbc:drill:zk=local> create table drill_parquet_with_fix as SELECT current_date, CAST('1970-01-05' as date) as unix_fifth_day, CAST('1970-01-01' as date) as unix_first_day FROM (VALUES(1)); +-----------+----------------------------+ | Fragment | Number of records written | +-----------+----------------------------+ | 0_0 | 1 | +-----------+----------------------------+ 1 row selected (0.257 seconds) 0: jdbc:drill:zk=local> select * from drill_parquet_with_fix; +---------------+-----------------+-----------------+ | current_date | unix_fifth_day | unix_first_day | +---------------+-----------------+-----------------+ | 2016-08-15 | 1970-01-05 | 1970-01-01 | +---------------+-----------------+-----------------+ 1 row selected (0.174 seconds) {code} {code} vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar cat /tmp/drill_parquet_with_fix current_date = 17028 unix_fifth_day = 4 unix_first_day = 0 {code} > Parquet file with DATE logical type produces wrong results for simple SELECT > ---------------------------------------------------------------------------- > > Key: DRILL-4763 > URL: https://issues.apache.org/jira/browse/DRILL-4763 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Data Types > Affects Versions: 1.6.0 > Reporter: Paul Rogers > Assignee: Vitalii Diravka > Attachments: date.parquet, int_16.parquet > > > Created a simple Parquet file with the following schema: > message test { required int32 index; required int32 value (DATE); required > int32 raw; } > That is, a file with an int32 storage type and a DATE logical type. Then, > created a number of test values: > 0 (which should be interpreted as 1970-01-01) and > (int) (System.currentTimeMillis() / (24*60*60*1000) ) Which should be > interpreted as the number of days since 1970-01-01 and today. > According to the Parquet spec > (https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md), > Parquet dates are expressed as "the number of days from the Unix epoch, 1 > January 1970." > Java timestamps are expressed as "measured in milliseconds, between the > current time and midnight, January 1, 1970 UTC." > There is ambiguity here: Parquet dates are presumably local times not > absolute times, so the math above will actually tell us the date in London > right now, but that's close enough. > Generate the local file to date.parquet. Query it with: > SELECT * from `local`.`root`.`date.parquet`; > The results are incorrect: > index value raw > 1 -11395-10-18T00:00:00.000-07:52:58 0 > Here, we have a value of 0. The displayed date is decidedly not > 1970-01-01T00:00:00. We actually have many problems: > 1. The date is far off. > 2. The output shows time. But, the Parquet DATE format explcitly does NOT > include time, so it makes no sense to include it. > 3. The output attempts to show a time zone, but a time zone of -07:52:58, > while close to PST, is not right (there is no timezine that is of by 7:02 > from UTC.) > 4. The data has no time zone, Parquet DATE explicilty is a local time, so it > is impossible to know the relationship between that date an UTC. > The correct output (in ISO format) would be: 1970-01-01 > The last line should be today's date, but instead is: > 6 -11348-04-20T00:00:00.000-07:52:58 16986 > Expected: > 2016-07-04 > Note that all the information to produce the right information is available > to Drill: > 1. The DATE annotation says the meaning of the signed 32-bit integer. > 2. Given the starting point and duration in days, the conversion to Drill's > own internal date format is unambiguous. > 3. The DATE annotation says that the date is local, so Drill should not > attempt to convert to UTC. (That is, a Java Date object can't be used, > instead a Joda/Java 8 LocalDate is necessary.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)