[jira] [Comment Edited] (DRILL-4203) Parquet File : Date is stored wrongly

Jason Altekruse (JIRA) Mon, 25 Jan 2016 16:28:58 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116281#comment-15116281
 ]


Jason Altekruse edited comment on DRILL-4203 at 1/26/16 12:28 AM:
------------------------------------------------------------------

[~zfong] That is correct. The only extra complexity is that I have added an 
option that allows users to optionally turn-off auto-correction for any files 
that are not certain to have been created by Drill.

The default behavior will be to check the file level created-by metadata, if we 
know it is a version of Drill after the fix, no correction will happen 
regardless of the setting of the option. Similarly for a file with a drill 
version string, that indicates the data was written before this fix, we will 
always correct the data, regardless of the setting of this flag.

The only complicated case is where there is not enough metadata to determine if 
it is a Drill file or not. In this case we will check the values in the file, 
either in the file level min/max statistics when the reader is initialized or 
when the file lacks min/max value statistics (it's a pre-1.0 drill file) we 
will have to defer detection until actually reading individual data pages. 
Checks at both of these levels can be disabled by the option.

The nature of the bug caused a really significant shift of the dates, putting 
them thousands of years into the future. Thus auto-correction as the default 
isn't high risk as it extremely unlikely users will have created a database 
full of dates in this range. That being said, the option is included to cover 
any such cases.


was (Author: jaltekruse):
[~zfong] That is correct. The only extra complexity is that I have added an 
option that allows users to optionally turn-off auto-correction for any files 
that are not certain to have been created by Drill.

The default behavior will be to check the file level created-by metadata, if we 
know it is a version of Drill after the fix, not correction will happen 
regardless of the setting of the option. Similarly for a file with a drill 
version string, that indicates the data was written before this fix, we will 
always correct the data, regardless of the setting of this flag.

The only complicated case is where there is not enough metadata to determine if 
it is a Drill file or not. In this case we will check the values in the file, 
either in the file level min/max statistics when the reader is initialized or 
when the file lacks min/max value statistics (it's a pre-1.0 drill file) we 
will have to defer detection until actually reading individual data pages. 
Checks at both of these levels can be disabled by the option.

The nature of the bug caused a really significant shift of the dates, putting 
them thousands of years into the future. Thus auto-correction as the default 
isn't high risk as it extremely unlikely users will have created a database 
full of dates in this range. That being said, the option is included to cover 
any such cases.

> Parquet File : Date is stored wrongly
> -------------------------------------
>
>                 Key: DRILL-4203
>                 URL: https://issues.apache.org/jira/browse/DRILL-4203
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>            Reporter: Stéphane Trou
>            Assignee: Jason Altekruse
>            Priority: Critical
>
> Hello,
> I have some problems when i try to read parquet files produce by drill with  
> Spark,  all dates are corrupted.
> I think the problem come from drill :)
> {code}
> cat /tmp/date_parquet.csv 
> Epoch,1970-01-01
> {code}
> {code}
> 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) 
> as epoch_date from dfs.tmp.`date_parquet.csv`;
> +--------+-------------+
> |  name  | epoch_date  |
> +--------+-------------+
> | Epoch  | 1970-01-01  |
> +--------+-------------+
> {code}
> {code}
> 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select 
> columns[0] as name, cast(columns[1] as date) as epoch_date from 
> dfs.tmp.`date_parquet.csv`;
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> {code}
> When I read the file with parquet tools, i found  
> {code}
> java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
> name = Epoch
> epoch_date = 4881176
> {code}
> According to 
> [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date], 
> epoch_date should be equals to 0.
> Meta : 
> {code}
> java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
> file:        file:/tmp/buggy_parquet/0_0_0.parquet 
> creator:     parquet-mr version 1.8.1-drill-r0 (build 
> 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) 
> extra:       drill.version = 1.4.0 
> file schema: root 
> --------------------------------------------------------------------------------
> name:        OPTIONAL BINARY O:UTF8 R:0 D:1
> epoch_date:  OPTIONAL INT32 O:DATE R:0 D:1
> row group 1: RC:1 TS:93 OFFSET:4 
> --------------------------------------------------------------------------------
> name:         BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 
> ENC:RLE,BIT_PACKED,PLAIN
> epoch_date:   INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 
> ENC:RLE,BIT_PACKED,PLAIN
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (DRILL-4203) Parquet File : Date is stored wrongly

Reply via email to