[
https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113524#comment-15113524
]
Jason Altekruse commented on DRILL-4203:
----------------------------------------
Fixing this ended up taking longer than I anticipated, but it looks like the
fix can be entirely transparent for Drill users. Initially use of external
tools will require rewriting the data, but as I had said previously we are
hoping to get the auto-correction into the parquet-mr library as well.
Here is a branch with my work in progress:
https://github.com/jaltekruse/incubator-drill/tree/4203-parquet-dates-bug
The last issue I am working to solve is correcting the dates in the case where
we are lacking enough info in the file metadata to determine if corruption is
definitely present or absent, it looks like this applies to files created
before Drill 1.0 and there is a possiblity that these corrupt files may overlap
with non-corrupt files generated by other tools. In this case we will need to
modify the actual read of the data pages to test individual values for
corruption. I don't anticipate this taking much longer, the refactoring is
already in progress.
> Parquet File : Date is stored wrongly
> -------------------------------------
>
> Key: DRILL-4203
> URL: https://issues.apache.org/jira/browse/DRILL-4203
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.4.0
> Reporter: Stéphane Trou
> Assignee: Jason Altekruse
> Priority: Critical
>
> Hello,
> I have some problems when i try to read parquet files produce by drill with
> Spark, all dates are corrupted.
> I think the problem come from drill :)
> {code}
> cat /tmp/date_parquet.csv
> Epoch,1970-01-01
> {code}
> {code}
> 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date)
> as epoch_date from dfs.tmp.`date_parquet.csv`;
> +--------+-------------+
> | name | epoch_date |
> +--------+-------------+
> | Epoch | 1970-01-01 |
> +--------+-------------+
> {code}
> {code}
> 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select
> columns[0] as name, cast(columns[1] as date) as epoch_date from
> dfs.tmp.`date_parquet.csv`;
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 1 |
> +-----------+----------------------------+
> {code}
> When I read the file with parquet tools, i found
> {code}
> java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
> name = Epoch
> epoch_date = 4881176
> {code}
> According to
> [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date],
> epoch_date should be equals to 0.
> Meta :
> {code}
> java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
> file: file:/tmp/buggy_parquet/0_0_0.parquet
> creator: parquet-mr version 1.8.1-drill-r0 (build
> 6b605a4ea05b66e1a6bf843353abcb4834a4ced8)
> extra: drill.version = 1.4.0
> file schema: root
> --------------------------------------------------------------------------------
> name: OPTIONAL BINARY O:UTF8 R:0 D:1
> epoch_date: OPTIONAL INT32 O:DATE R:0 D:1
> row group 1: RC:1 TS:93 OFFSET:4
> --------------------------------------------------------------------------------
> name: BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1
> ENC:RLE,BIT_PACKED,PLAIN
> epoch_date: INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1
> ENC:RLE,BIT_PACKED,PLAIN
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)