[jira] [Updated] (DRILL-4203) Parquet File : Date is stored wrongly

Vitalii Diravka (JIRA) Tue, 18 Oct 2016 04:04:02 -0700

     [ 
https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vitalii Diravka updated DRILL-4203:
-----------------------------------
    Description: 
Hello,

I have some problems when i try to read parquet files produce by drill with  
Spark,  all dates are corrupted.

I think the problem come from drill :)

{code}
cat /tmp/date_parquet.csv 
Epoch,1970-01-01
{code}

{code}
0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) as 
epoch_date from dfs.tmp.`date_parquet.csv`;
+--------+-------------+
|  name  | epoch_date  |
+--------+-------------+
| Epoch  | 1970-01-01  |
+--------+-------------+
{code}

{code}
0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select 
columns[0] as name, cast(columns[1] as date) as epoch_date from 
dfs.tmp.`date_parquet.csv`;
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1                          |
+-----------+----------------------------+
{code}

When I read the file with parquet tools, i found  
{code}
java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
name = Epoch
epoch_date = 4881176
{code}

According to 
[https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date], 
epoch_date should be equals to 0.

Meta : 
{code}
java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
file:        file:/tmp/buggy_parquet/0_0_0.parquet 
creator:     parquet-mr version 1.8.1-drill-r0 (build 
6b605a4ea05b66e1a6bf843353abcb4834a4ced8) 
extra:       drill.version = 1.4.0 

file schema: root 
--------------------------------------------------------------------------------
name:        OPTIONAL BINARY O:UTF8 R:0 D:1
epoch_date:  OPTIONAL INT32 O:DATE R:0 D:1

row group 1: RC:1 TS:93 OFFSET:4 
--------------------------------------------------------------------------------
name:         BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 
ENC:RLE,BIT_PACKED,PLAIN
epoch_date:   INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 
ENC:RLE,BIT_PACKED,PLAIN
{code}




Implementation:

After the fix Drill can automatically determine date corruption in parquet 
files 
and convert it to correct values.

For the reason, when the user want to work with the dates over the 5 000 years,
an option is included to turn off the auto-correction.
Use of this option is assumed to be extremely unlikely, but it is included for
completeness.
To disable "auto correction" you should use the parquet config in the plugin 
settings. Something like this:
{code}
  "formats": {
    "parquet": {
      "type": "parquet",
      "autoCorrectCorruptDates": false
    }
{code}
Or you can try to use the query like this:
{code}
select l_shipdate, l_commitdate from 
table(dfs.`/drill/testdata/parquet_date/dates_nodrillversion/drillgen2_lineitem`
 
(type => 'parquet', autoCorrectCorruptDates => false)) limit 1;
{code}

After the fix the new files generated from drill will have 
"is.date.correct=true" extra property in parquet 
metadata, which defines that the file can't involve corrupted date values.



  was:
Hello,

I have some problems when i try to read parquet files produce by drill with  
Spark,  all dates are corrupted.

I think the problem come from drill :)

{code}
cat /tmp/date_parquet.csv 
Epoch,1970-01-01
{code}

{code}
0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) as 
epoch_date from dfs.tmp.`date_parquet.csv`;
+--------+-------------+
|  name  | epoch_date  |
+--------+-------------+
| Epoch  | 1970-01-01  |
+--------+-------------+
{code}

{code}
0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select 
columns[0] as name, cast(columns[1] as date) as epoch_date from 
dfs.tmp.`date_parquet.csv`;
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1                          |
+-----------+----------------------------+
{code}

When I read the file with parquet tools, i found  
{code}
java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
name = Epoch
epoch_date = 4881176
{code}

According to 
[https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date], 
epoch_date should be equals to 0.

Meta : 
{code}
java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
file:        file:/tmp/buggy_parquet/0_0_0.parquet 
creator:     parquet-mr version 1.8.1-drill-r0 (build 
6b605a4ea05b66e1a6bf843353abcb4834a4ced8) 
extra:       drill.version = 1.4.0 

file schema: root 
--------------------------------------------------------------------------------
name:        OPTIONAL BINARY O:UTF8 R:0 D:1
epoch_date:  OPTIONAL INT32 O:DATE R:0 D:1

row group 1: RC:1 TS:93 OFFSET:4 
--------------------------------------------------------------------------------
name:         BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 
ENC:RLE,BIT_PACKED,PLAIN
epoch_date:   INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 
ENC:RLE,BIT_PACKED,PLAIN
{code}




> Parquet File : Date is stored wrongly
> -------------------------------------
>
>                 Key: DRILL-4203
>                 URL: https://issues.apache.org/jira/browse/DRILL-4203
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>            Reporter: Stéphane Trou
>            Assignee: Vitalii Diravka
>            Priority: Critical
>              Labels: doc-impacting
>             Fix For: 1.9.0
>
>
> Hello,
> I have some problems when i try to read parquet files produce by drill with  
> Spark,  all dates are corrupted.
> I think the problem come from drill :)
> {code}
> cat /tmp/date_parquet.csv 
> Epoch,1970-01-01
> {code}
> {code}
> 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) 
> as epoch_date from dfs.tmp.`date_parquet.csv`;
> +--------+-------------+
> |  name  | epoch_date  |
> +--------+-------------+
> | Epoch  | 1970-01-01  |
> +--------+-------------+
> {code}
> {code}
> 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select 
> columns[0] as name, cast(columns[1] as date) as epoch_date from 
> dfs.tmp.`date_parquet.csv`;
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 1                          |
> +-----------+----------------------------+
> {code}
> When I read the file with parquet tools, i found  
> {code}
> java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
> name = Epoch
> epoch_date = 4881176
> {code}
> According to 
> [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date], 
> epoch_date should be equals to 0.
> Meta : 
> {code}
> java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
> file:        file:/tmp/buggy_parquet/0_0_0.parquet 
> creator:     parquet-mr version 1.8.1-drill-r0 (build 
> 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) 
> extra:       drill.version = 1.4.0 
> file schema: root 
> --------------------------------------------------------------------------------
> name:        OPTIONAL BINARY O:UTF8 R:0 D:1
> epoch_date:  OPTIONAL INT32 O:DATE R:0 D:1
> row group 1: RC:1 TS:93 OFFSET:4 
> --------------------------------------------------------------------------------
> name:         BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 
> ENC:RLE,BIT_PACKED,PLAIN
> epoch_date:   INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 
> ENC:RLE,BIT_PACKED,PLAIN
> {code}
> Implementation:
> After the fix Drill can automatically determine date corruption in parquet 
> files 
> and convert it to correct values.
> For the reason, when the user want to work with the dates over the 5 000 
> years,
> an option is included to turn off the auto-correction.
> Use of this option is assumed to be extremely unlikely, but it is included for
> completeness.
> To disable "auto correction" you should use the parquet config in the plugin 
> settings. Something like this:
> {code}
>   "formats": {
>     "parquet": {
>       "type": "parquet",
>       "autoCorrectCorruptDates": false
>     }
> {code}
> Or you can try to use the query like this:
> {code}
> select l_shipdate, l_commitdate from 
> table(dfs.`/drill/testdata/parquet_date/dates_nodrillversion/drillgen2_lineitem`
>  
> (type => 'parquet', autoCorrectCorruptDates => false)) limit 1;
> {code}
> After the fix the new files generated from drill will have 
> "is.date.correct=true" extra property in parquet 
> metadata, which defines that the file can't involve corrupted date values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (DRILL-4203) Parquet File : Date is stored wrongly

Reply via email to