Adam Gilmore created DRILL-2286:
-----------------------------------

             Summary: Parquet compression causes read errors
                 Key: DRILL-2286
                 URL: https://issues.apache.org/jira/browse/DRILL-2286
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 0.8.0
            Reporter: Adam Gilmore
            Assignee: Steven Phillips
            Priority: Critical


>From what I can see, since compression has been added to the Parquet writer, 
>reading errors can occur.

Basically, things like timestamp and decimal are stored as int64 with some 
metadata.  It appears that when the column is compressed, it tries to read 
int64s into a vector of timestamp/decimal types, which causes a cast error.

Here's the JSON file I'm using:

{ "a": 1.5 }
{ "a": 3.5 }
{ "a": 1.5 }
{ "a": 2.5 }
{ "a": 1.5 }
{ "a": 5.5 }
{ "a": 1.5 }
{ "a": 6.0 }
{ "a": 1.5 }

Now create a Parquet table like so:

create table dfs.tmp.test as (select cast(a as decimal(18,8)) from 
dfs.tmp.`test.json`)

Now when you try to query it like so:

0: jdbc:drill:zk=local> select * from dfs.tmp.test;
Query failed: RemoteRpcException: Failure while running fragment., 
org.apache.drill.exec.vector.NullableDecimal18Vector cannot be cast to 
org.apache.drill.exec.vector.NullableBigIntVector [ 
91e23d42-fa06-4429-b78e-3ff32352e660 on ...:31010 ]
[ 91e23d42-fa06-4429-b78e-3ff32352e660 on ...:31010 ]

Error: exception while executing query: Failure while executing query. 
(state=,code=0)

This is the same for timestamps, for example.

The relevant code is in ColumnReaderFactory whereby if the column chunk is 
encoded, it creates specific readers based on the type of the column (in this 
case int64, instead of timestamp/decimal).

This is pretty severe, as it looks like the compression is enabled by default 
now.  I do note that with only 1-2 records in the JSON file, it doesn't bother 
compressing and the queries then work fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to