[jira] [Commented] (DRILL-4203) Parquet File : Date is stored wrongly

Jason Altekruse (JIRA) Wed, 30 Dec 2015 22:22:37 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075769#comment-15075769
 ]


Jason Altekruse commented on DRILL-4203:
----------------------------------------

Jacques, there was some confusion when we were trying to read the spec, but I 
seem to recall that eventually we had determined that the term "Julian Day" was 
just being used to refer to a count of days. Oddly I cannot find a reference to 
Julian day anywhere either, but I I could have also sworn it was somewhere. I 
am pretty sure I knew that this count was defined to be centered around the 
Unix epoch, not the Astronomical Julian day.

Very unfortunately it looks like we have been writing dates incorrectly since 
the feature was added, and interpreting the incorrect values in our reader.

The fix is small, but we are now going to have to migrate all files written 
with Drill previously to allow them to be read correctly with the fixed reader. 
I'll start a thread on this as we will need to have an effective strategy for 
migrating all existing files with dates. The recently created migration tool 
only rewrote metadata to fix a performance regression related to statistics, 
not a correctness issue. This will have a more significant impact on users.

I wrote this code originally and I blame myself for not testing it more 
thoroughly. I will also open a discussion on how we can prevent issues like 
this in the future. As we allow for storage systems in drill to be pluggable, 
it is important that we both test the default storage formats we can write 
thoroughly against other tools as well as provide the appropriate tools for 
plugin developers to completely test their own code.

> Parquet File : Date is stored wrongly
> -------------------------------------
>
>                 Key: DRILL-4203
>                 URL: https://issues.apache.org/jira/browse/DRILL-4203
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>            Reporter: Stéphane Trou
>            Assignee: Jason Altekruse
>
> Hello,
> I have some problems when i try to read parquet files produce by drill with  
> Spark,  all dates are corrupted.
> I think the problem come from drill :)
> {code}
> cat /tmp/date_parquet.csv 
> Epoch,1970-01-01
> {code}
> {code}
> 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) 
> as epoch_date from dfs.tmp.`date_parquet.csv`;
> +--------+-------------+
> |  name  | epoch_date  |
> +--------+-------------+
> | Epoch  | 1970-01-01  |
> +--------+-------------+
> {code}
> {code}
> 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select 
> columns[0] as name, cast(columns[1] as date) as epoch_date from 
> dfs.tmp.`date_parquet.csv`;
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> {code}
> When I read the file with parquet tools, i found  
> {code}
> java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
> name = Epoch
> epoch_date = 4881176
> {code}
> According to 
> [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date], 
> epoch_date should be equals to 0.
> Meta : 
> {code}
> java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
> file:        file:/tmp/buggy_parquet/0_0_0.parquet 
> creator:     parquet-mr version 1.8.1-drill-r0 (build 
> 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) 
> extra:       drill.version = 1.4.0 
> file schema: root 
> --------------------------------------------------------------------------------
> name:        OPTIONAL BINARY O:UTF8 R:0 D:1
> epoch_date:  OPTIONAL INT32 O:DATE R:0 D:1
> row group 1: RC:1 TS:93 OFFSET:4 
> --------------------------------------------------------------------------------
> name:         BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 
> ENC:RLE,BIT_PACKED,PLAIN
> epoch_date:   INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 
> ENC:RLE,BIT_PACKED,PLAIN
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4203) Parquet File : Date is stored wrongly

Reply via email to