[
https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075769#comment-15075769
]
Jason Altekruse commented on DRILL-4203:
----------------------------------------
Jacques, there was some confusion when we were trying to read the spec, but I
seem to recall that eventually we had determined that the term "Julian Day" was
just being used to refer to a count of days. Oddly I cannot find a reference to
Julian day anywhere either, but I I could have also sworn it was somewhere. I
am pretty sure I knew that this count was defined to be centered around the
Unix epoch, not the Astronomical Julian day.
Very unfortunately it looks like we have been writing dates incorrectly since
the feature was added, and interpreting the incorrect values in our reader.
The fix is small, but we are now going to have to migrate all files written
with Drill previously to allow them to be read correctly with the fixed reader.
I'll start a thread on this as we will need to have an effective strategy for
migrating all existing files with dates. The recently created migration tool
only rewrote metadata to fix a performance regression related to statistics,
not a correctness issue. This will have a more significant impact on users.
I wrote this code originally and I blame myself for not testing it more
thoroughly. I will also open a discussion on how we can prevent issues like
this in the future. As we allow for storage systems in drill to be pluggable,
it is important that we both test the default storage formats we can write
thoroughly against other tools as well as provide the appropriate tools for
plugin developers to completely test their own code.
> Parquet File : Date is stored wrongly
> -------------------------------------
>
> Key: DRILL-4203
> URL: https://issues.apache.org/jira/browse/DRILL-4203
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.4.0
> Reporter: Stéphane Trou
> Assignee: Jason Altekruse
>
> Hello,
> I have some problems when i try to read parquet files produce by drill with
> Spark, all dates are corrupted.
> I think the problem come from drill :)
> {code}
> cat /tmp/date_parquet.csv
> Epoch,1970-01-01
> {code}
> {code}
> 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date)
> as epoch_date from dfs.tmp.`date_parquet.csv`;
> +--------+-------------+
> | name | epoch_date |
> +--------+-------------+
> | Epoch | 1970-01-01 |
> +--------+-------------+
> {code}
> {code}
> 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select
> columns[0] as name, cast(columns[1] as date) as epoch_date from
> dfs.tmp.`date_parquet.csv`;
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 1 |
> +-----------+----------------------------+
> {code}
> When I read the file with parquet tools, i found
> {code}
> java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
> name = Epoch
> epoch_date = 4881176
> {code}
> According to
> [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date],
> epoch_date should be equals to 0.
> Meta :
> {code}
> java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
> file: file:/tmp/buggy_parquet/0_0_0.parquet
> creator: parquet-mr version 1.8.1-drill-r0 (build
> 6b605a4ea05b66e1a6bf843353abcb4834a4ced8)
> extra: drill.version = 1.4.0
> file schema: root
> --------------------------------------------------------------------------------
> name: OPTIONAL BINARY O:UTF8 R:0 D:1
> epoch_date: OPTIONAL INT32 O:DATE R:0 D:1
> row group 1: RC:1 TS:93 OFFSET:4
> --------------------------------------------------------------------------------
> name: BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1
> ENC:RLE,BIT_PACKED,PLAIN
> epoch_date: INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1
> ENC:RLE,BIT_PACKED,PLAIN
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)