[ https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543230#comment-15543230 ]
ASF GitHub Bot commented on DRILL-4203: --------------------------------------- Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/595#discussion_r81625440 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -935,6 +972,11 @@ public ColumnTypeMetadata_v2 getColumnTypeInfo(String[] name) { @JsonIgnore @Override public ParquetTableMetadataBase clone() { return new ParquetTableMetadata_v2(files, directories, columnTypeInfo); } + + @JsonIgnore @Override public boolean isDateCorrect() { + return isDateCorrect; --- End diff -- If metadata cache file is existed Drill reads it instead of retrieving metadata from multiple Parquet files. In the case when it was generated with drill after this commit the value of isDateCorrect will be true. In the case when it was generated with drill before this commit the isDateCorrect field in metadata cache file will be absent and value of this will be false in ParquetTableMetadata_v2. And according to this value we just define DateCorruptionStatus (you can see more in ParquetReaderUtility.correctDatesInMetadataCache()). The leftover way of data checking in the cache was not changed. > Parquet File : Date is stored wrongly > ------------------------------------- > > Key: DRILL-4203 > URL: https://issues.apache.org/jira/browse/DRILL-4203 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.4.0 > Reporter: Stéphane Trou > Assignee: Vitalii Diravka > Priority: Critical > Fix For: 1.9.0 > > > Hello, > I have some problems when i try to read parquet files produce by drill with > Spark, all dates are corrupted. > I think the problem come from drill :) > {code} > cat /tmp/date_parquet.csv > Epoch,1970-01-01 > {code} > {code} > 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) > as epoch_date from dfs.tmp.`date_parquet.csv`; > +--------+-------------+ > | name | epoch_date | > +--------+-------------+ > | Epoch | 1970-01-01 | > +--------+-------------+ > {code} > {code} > 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select > columns[0] as name, cast(columns[1] as date) as epoch_date from > dfs.tmp.`date_parquet.csv`; > +-----------+----------------------------+ > | Fragment | Number of records written | > +-----------+----------------------------+ > | 0_0 | 1 | > +-----------+----------------------------+ > {code} > When I read the file with parquet tools, i found > {code} > java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/ > name = Epoch > epoch_date = 4881176 > {code} > According to > [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date], > epoch_date should be equals to 0. > Meta : > {code} > java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/ > file: file:/tmp/buggy_parquet/0_0_0.parquet > creator: parquet-mr version 1.8.1-drill-r0 (build > 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) > extra: drill.version = 1.4.0 > file schema: root > -------------------------------------------------------------------------------- > name: OPTIONAL BINARY O:UTF8 R:0 D:1 > epoch_date: OPTIONAL INT32 O:DATE R:0 D:1 > row group 1: RC:1 TS:93 OFFSET:4 > -------------------------------------------------------------------------------- > name: BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 > ENC:RLE,BIT_PACKED,PLAIN > epoch_date: INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 > ENC:RLE,BIT_PACKED,PLAIN > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)