Adam Gilmore created DRILL-2286:
-----------------------------------
Summary: Parquet compression causes read errors
Key: DRILL-2286
URL: https://issues.apache.org/jira/browse/DRILL-2286
Project: Apache Drill
Issue Type: Bug
Components: Storage - Parquet
Affects Versions: 0.8.0
Reporter: Adam Gilmore
Assignee: Steven Phillips
Priority: Critical
>From what I can see, since compression has been added to the Parquet writer,
>reading errors can occur.
Basically, things like timestamp and decimal are stored as int64 with some
metadata. It appears that when the column is compressed, it tries to read
int64s into a vector of timestamp/decimal types, which causes a cast error.
Here's the JSON file I'm using:
{ "a": 1.5 }
{ "a": 3.5 }
{ "a": 1.5 }
{ "a": 2.5 }
{ "a": 1.5 }
{ "a": 5.5 }
{ "a": 1.5 }
{ "a": 6.0 }
{ "a": 1.5 }
Now create a Parquet table like so:
create table dfs.tmp.test as (select cast(a as decimal(18,8)) from
dfs.tmp.`test.json`)
Now when you try to query it like so:
0: jdbc:drill:zk=local> select * from dfs.tmp.test;
Query failed: RemoteRpcException: Failure while running fragment.,
org.apache.drill.exec.vector.NullableDecimal18Vector cannot be cast to
org.apache.drill.exec.vector.NullableBigIntVector [
91e23d42-fa06-4429-b78e-3ff32352e660 on ...:31010 ]
[ 91e23d42-fa06-4429-b78e-3ff32352e660 on ...:31010 ]
Error: exception while executing query: Failure while executing query.
(state=,code=0)
This is the same for timestamps, for example.
The relevant code is in ColumnReaderFactory whereby if the column chunk is
encoded, it creates specific readers based on the type of the column (in this
case int64, instead of timestamp/decimal).
This is pretty severe, as it looks like the compression is enabled by default
now. I do note that with only 1-2 records in the JSON file, it doesn't bother
compressing and the queries then work fine.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)