[
https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vitalii Diravka updated DRILL-4203:
-----------------------------------
Description:
Hello,
I have some problems when i try to read parquet files produce by drill with
Spark, all dates are corrupted.
I think the problem come from drill :)
{code}
cat /tmp/date_parquet.csv
Epoch,1970-01-01
{code}
{code}
0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) as
epoch_date from dfs.tmp.`date_parquet.csv`;
+--------+-------------+
| name | epoch_date |
+--------+-------------+
| Epoch | 1970-01-01 |
+--------+-------------+
{code}
{code}
0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select
columns[0] as name, cast(columns[1] as date) as epoch_date from
dfs.tmp.`date_parquet.csv`;
+-----------+----------------------------+
| Fragment | Number of records written |
+-----------+----------------------------+
| 0_0 | 1 |
+-----------+----------------------------+
{code}
When I read the file with parquet tools, i found
{code}
java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
name = Epoch
epoch_date = 4881176
{code}
According to
[https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date],
epoch_date should be equals to 0.
Meta :
{code}
java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
file: file:/tmp/buggy_parquet/0_0_0.parquet
creator: parquet-mr version 1.8.1-drill-r0 (build
6b605a4ea05b66e1a6bf843353abcb4834a4ced8)
extra: drill.version = 1.4.0
file schema: root
--------------------------------------------------------------------------------
name: OPTIONAL BINARY O:UTF8 R:0 D:1
epoch_date: OPTIONAL INT32 O:DATE R:0 D:1
row group 1: RC:1 TS:93 OFFSET:4
--------------------------------------------------------------------------------
name: BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1
ENC:RLE,BIT_PACKED,PLAIN
epoch_date: INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1
ENC:RLE,BIT_PACKED,PLAIN
{code}
Implementation:
After the fix Drill can automatically determine date corruption in parquet
files
and convert it to correct values.
For the reason, when the user want to work with the dates over the 5 000 years,
an option is included to turn off the auto-correction.
Use of this option is assumed to be extremely unlikely, but it is included for
completeness.
To disable "auto correction" you should use the parquet config in the plugin
settings. Something like this:
{code}
"formats": {
"parquet": {
"type": "parquet",
"autoCorrectCorruptDates": false
}
{code}
Or you can try to use the query like this:
{code}
select l_shipdate, l_commitdate from
table(dfs.`/drill/testdata/parquet_date/dates_nodrillversion/drillgen2_lineitem`
(type => 'parquet', autoCorrectCorruptDates => false)) limit 1;
{code}
After the fix the new files generated from drill will have
"is.date.correct=true" extra property in parquet
metadata, which defines that the file can't involve corrupted date values.
was:
Hello,
I have some problems when i try to read parquet files produce by drill with
Spark, all dates are corrupted.
I think the problem come from drill :)
{code}
cat /tmp/date_parquet.csv
Epoch,1970-01-01
{code}
{code}
0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) as
epoch_date from dfs.tmp.`date_parquet.csv`;
+--------+-------------+
| name | epoch_date |
+--------+-------------+
| Epoch | 1970-01-01 |
+--------+-------------+
{code}
{code}
0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select
columns[0] as name, cast(columns[1] as date) as epoch_date from
dfs.tmp.`date_parquet.csv`;
+-----------+----------------------------+
| Fragment | Number of records written |
+-----------+----------------------------+
| 0_0 | 1 |
+-----------+----------------------------+
{code}
When I read the file with parquet tools, i found
{code}
java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
name = Epoch
epoch_date = 4881176
{code}
According to
[https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date],
epoch_date should be equals to 0.
Meta :
{code}
java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
file: file:/tmp/buggy_parquet/0_0_0.parquet
creator: parquet-mr version 1.8.1-drill-r0 (build
6b605a4ea05b66e1a6bf843353abcb4834a4ced8)
extra: drill.version = 1.4.0
file schema: root
--------------------------------------------------------------------------------
name: OPTIONAL BINARY O:UTF8 R:0 D:1
epoch_date: OPTIONAL INT32 O:DATE R:0 D:1
row group 1: RC:1 TS:93 OFFSET:4
--------------------------------------------------------------------------------
name: BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1
ENC:RLE,BIT_PACKED,PLAIN
epoch_date: INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1
ENC:RLE,BIT_PACKED,PLAIN
{code}
> Parquet File : Date is stored wrongly
> -------------------------------------
>
> Key: DRILL-4203
> URL: https://issues.apache.org/jira/browse/DRILL-4203
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.4.0
> Reporter: Stéphane Trou
> Assignee: Vitalii Diravka
> Priority: Critical
> Labels: doc-impacting
> Fix For: 1.9.0
>
>
> Hello,
> I have some problems when i try to read parquet files produce by drill with
> Spark, all dates are corrupted.
> I think the problem come from drill :)
> {code}
> cat /tmp/date_parquet.csv
> Epoch,1970-01-01
> {code}
> {code}
> 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date)
> as epoch_date from dfs.tmp.`date_parquet.csv`;
> +--------+-------------+
> | name | epoch_date |
> +--------+-------------+
> | Epoch | 1970-01-01 |
> +--------+-------------+
> {code}
> {code}
> 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select
> columns[0] as name, cast(columns[1] as date) as epoch_date from
> dfs.tmp.`date_parquet.csv`;
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 1 |
> +-----------+----------------------------+
> {code}
> When I read the file with parquet tools, i found
> {code}
> java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
> name = Epoch
> epoch_date = 4881176
> {code}
> According to
> [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date],
> epoch_date should be equals to 0.
> Meta :
> {code}
> java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
> file: file:/tmp/buggy_parquet/0_0_0.parquet
> creator: parquet-mr version 1.8.1-drill-r0 (build
> 6b605a4ea05b66e1a6bf843353abcb4834a4ced8)
> extra: drill.version = 1.4.0
> file schema: root
> --------------------------------------------------------------------------------
> name: OPTIONAL BINARY O:UTF8 R:0 D:1
> epoch_date: OPTIONAL INT32 O:DATE R:0 D:1
> row group 1: RC:1 TS:93 OFFSET:4
> --------------------------------------------------------------------------------
> name: BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1
> ENC:RLE,BIT_PACKED,PLAIN
> epoch_date: INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1
> ENC:RLE,BIT_PACKED,PLAIN
> {code}
> Implementation:
> After the fix Drill can automatically determine date corruption in parquet
> files
> and convert it to correct values.
> For the reason, when the user want to work with the dates over the 5 000
> years,
> an option is included to turn off the auto-correction.
> Use of this option is assumed to be extremely unlikely, but it is included for
> completeness.
> To disable "auto correction" you should use the parquet config in the plugin
> settings. Something like this:
> {code}
> "formats": {
> "parquet": {
> "type": "parquet",
> "autoCorrectCorruptDates": false
> }
> {code}
> Or you can try to use the query like this:
> {code}
> select l_shipdate, l_commitdate from
> table(dfs.`/drill/testdata/parquet_date/dates_nodrillversion/drillgen2_lineitem`
>
> (type => 'parquet', autoCorrectCorruptDates => false)) limit 1;
> {code}
> After the fix the new files generated from drill will have
> "is.date.correct=true" extra property in parquet
> metadata, which defines that the file can't involve corrupted date values.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)