[
https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083365#comment-15083365
]
Jason Altekruse commented on DRILL-4203:
----------------------------------------
That is correct, Drill is currently reading it's own (non-standard) dates fine.
I will have a patch up shortly with both the fix to write and read the correct
format as well as work around the issue with the old files by looking at the
version number in the parquet metadata.
Unfortunately as we saw with the previous issue a few months ago, Drill has not
always been writing a unique version string into the parquet files it produced.
So automatic correction of these files might still require the same metadata
migration as before to appropriately tag the files as having been created by
Drill.
I am going to open a discussion on the Parquet list about this, as soon as I
have a patch up for it, to see if they would be willing to include the
workaround to read the corrupt values in the parquet-mr library, which would
prevent the need to rewrite all of the data to have it read correctly from
other tools. A complete rewrite may be more desirable in some cases, so that
the files are consistent, but we are trying to mitigate the need for such
expensive operations.
> Parquet File : Date is stored wrongly
> -------------------------------------
>
> Key: DRILL-4203
> URL: https://issues.apache.org/jira/browse/DRILL-4203
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.4.0
> Reporter: Stéphane Trou
> Assignee: Jason Altekruse
> Priority: Critical
>
> Hello,
> I have some problems when i try to read parquet files produce by drill with
> Spark, all dates are corrupted.
> I think the problem come from drill :)
> {code}
> cat /tmp/date_parquet.csv
> Epoch,1970-01-01
> {code}
> {code}
> 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date)
> as epoch_date from dfs.tmp.`date_parquet.csv`;
> +--------+-------------+
> | name | epoch_date |
> +--------+-------------+
> | Epoch | 1970-01-01 |
> +--------+-------------+
> {code}
> {code}
> 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select
> columns[0] as name, cast(columns[1] as date) as epoch_date from
> dfs.tmp.`date_parquet.csv`;
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 1 |
> +-----------+----------------------------+
> {code}
> When I read the file with parquet tools, i found
> {code}
> java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
> name = Epoch
> epoch_date = 4881176
> {code}
> According to
> [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date],
> epoch_date should be equals to 0.
> Meta :
> {code}
> java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
> file: file:/tmp/buggy_parquet/0_0_0.parquet
> creator: parquet-mr version 1.8.1-drill-r0 (build
> 6b605a4ea05b66e1a6bf843353abcb4834a4ced8)
> extra: drill.version = 1.4.0
> file schema: root
> --------------------------------------------------------------------------------
> name: OPTIONAL BINARY O:UTF8 R:0 D:1
> epoch_date: OPTIONAL INT32 O:DATE R:0 D:1
> row group 1: RC:1 TS:93 OFFSET:4
> --------------------------------------------------------------------------------
> name: BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1
> ENC:RLE,BIT_PACKED,PLAIN
> epoch_date: INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1
> ENC:RLE,BIT_PACKED,PLAIN
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)