[jira] [Commented] (DRILL-2286) Parquet compression causes read errors

Adam Gilmore (JIRA) Sun, 22 Feb 2015 18:11:00 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332485#comment-14332485
 ]


Adam Gilmore commented on DRILL-2286:
-------------------------------------

Right you are - a duplicate it is.

> Parquet compression causes read errors
> --------------------------------------
>
>                 Key: DRILL-2286
>                 URL: https://issues.apache.org/jira/browse/DRILL-2286
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 0.8.0
>            Reporter: Adam Gilmore
>            Assignee: Steven Phillips
>            Priority: Critical
>
> From what I can see, since compression has been added to the Parquet writer, 
> reading errors can occur.
> Basically, things like timestamp and decimal are stored as int64 with some 
> metadata.  It appears that when the column is compressed, it tries to read 
> int64s into a vector of timestamp/decimal types, which causes a cast error.
> Here's the JSON file I'm using:
> {code}
> { "a": 1.5 }
> { "a": 3.5 }
> { "a": 1.5 }
> { "a": 2.5 }
> { "a": 1.5 }
> { "a": 5.5 }
> { "a": 1.5 }
> { "a": 6.0 }
> { "a": 1.5 }
> {code}
> Now create a Parquet table like so:
> create table dfs.tmp.test as (select cast(a as decimal(18,8)) from 
> dfs.tmp.`test.json`)
> Now when you try to query it like so:
> {noformat}
> 0: jdbc:drill:zk=local> select * from dfs.tmp.test;
> Query failed: RemoteRpcException: Failure while running fragment., 
> org.apache.drill.exec.vector.NullableDecimal18Vector cannot be cast to 
> org.apache.drill.exec.vector.NullableBigIntVector [ 
> 91e23d42-fa06-4429-b78e-3ff32352e660 on ...:31010 ]
> [ 91e23d42-fa06-4429-b78e-3ff32352e660 on ...:31010 ]
> Error: exception while executing query: Failure while executing query. 
> (state=,code=0)
> {noformat}
> This is the same for timestamps, for example.
> The relevant code is in ColumnReaderFactory whereby if the column chunk is 
> encoded, it creates specific readers based on the type of the column (in this 
> case int64, instead of timestamp/decimal).
> This is pretty severe, as it looks like the compression is enabled by default 
> now.  I do note that with only 1-2 records in the JSON file, it doesn't 
> bother compressing and the queries then work fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-2286) Parquet compression causes read errors

Reply via email to