[
https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116281#comment-15116281
]
Jason Altekruse edited comment on DRILL-4203 at 1/26/16 12:28 AM:
------------------------------------------------------------------
[~zfong] That is correct. The only extra complexity is that I have added an
option that allows users to optionally turn-off auto-correction for any files
that are not certain to have been created by Drill.
The default behavior will be to check the file level created-by metadata, if we
know it is a version of Drill after the fix, no correction will happen
regardless of the setting of the option. Similarly for a file with a drill
version string, that indicates the data was written before this fix, we will
always correct the data, regardless of the setting of this flag.
The only complicated case is where there is not enough metadata to determine if
it is a Drill file or not. In this case we will check the values in the file,
either in the file level min/max statistics when the reader is initialized or
when the file lacks min/max value statistics (it's a pre-1.0 drill file) we
will have to defer detection until actually reading individual data pages.
Checks at both of these levels can be disabled by the option.
The nature of the bug caused a really significant shift of the dates, putting
them thousands of years into the future. Thus auto-correction as the default
isn't high risk as it extremely unlikely users will have created a database
full of dates in this range. That being said, the option is included to cover
any such cases.
was (Author: jaltekruse):
[~zfong] That is correct. The only extra complexity is that I have added an
option that allows users to optionally turn-off auto-correction for any files
that are not certain to have been created by Drill.
The default behavior will be to check the file level created-by metadata, if we
know it is a version of Drill after the fix, not correction will happen
regardless of the setting of the option. Similarly for a file with a drill
version string, that indicates the data was written before this fix, we will
always correct the data, regardless of the setting of this flag.
The only complicated case is where there is not enough metadata to determine if
it is a Drill file or not. In this case we will check the values in the file,
either in the file level min/max statistics when the reader is initialized or
when the file lacks min/max value statistics (it's a pre-1.0 drill file) we
will have to defer detection until actually reading individual data pages.
Checks at both of these levels can be disabled by the option.
The nature of the bug caused a really significant shift of the dates, putting
them thousands of years into the future. Thus auto-correction as the default
isn't high risk as it extremely unlikely users will have created a database
full of dates in this range. That being said, the option is included to cover
any such cases.
> Parquet File : Date is stored wrongly
> -------------------------------------
>
> Key: DRILL-4203
> URL: https://issues.apache.org/jira/browse/DRILL-4203
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.4.0
> Reporter: Stéphane Trou
> Assignee: Jason Altekruse
> Priority: Critical
>
> Hello,
> I have some problems when i try to read parquet files produce by drill with
> Spark, all dates are corrupted.
> I think the problem come from drill :)
> {code}
> cat /tmp/date_parquet.csv
> Epoch,1970-01-01
> {code}
> {code}
> 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date)
> as epoch_date from dfs.tmp.`date_parquet.csv`;
> +--------+-------------+
> | name | epoch_date |
> +--------+-------------+
> | Epoch | 1970-01-01 |
> +--------+-------------+
> {code}
> {code}
> 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select
> columns[0] as name, cast(columns[1] as date) as epoch_date from
> dfs.tmp.`date_parquet.csv`;
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 1 |
> +-----------+----------------------------+
> {code}
> When I read the file with parquet tools, i found
> {code}
> java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
> name = Epoch
> epoch_date = 4881176
> {code}
> According to
> [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date],
> epoch_date should be equals to 0.
> Meta :
> {code}
> java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
> file: file:/tmp/buggy_parquet/0_0_0.parquet
> creator: parquet-mr version 1.8.1-drill-r0 (build
> 6b605a4ea05b66e1a6bf843353abcb4834a4ced8)
> extra: drill.version = 1.4.0
> file schema: root
> --------------------------------------------------------------------------------
> name: OPTIONAL BINARY O:UTF8 R:0 D:1
> epoch_date: OPTIONAL INT32 O:DATE R:0 D:1
> row group 1: RC:1 TS:93 OFFSET:4
> --------------------------------------------------------------------------------
> name: BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1
> ENC:RLE,BIT_PACKED,PLAIN
> epoch_date: INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1
> ENC:RLE,BIT_PACKED,PLAIN
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)