[jira] [Updated] (DRILL-6209) Spark generated Parquet file reading fails when 'store.parquet.reader.int96_as_timestamp' is used

Robert V (JIRA) Sun, 04 Mar 2018 07:46:27 -0800

     [ 
https://issues.apache.org/jira/browse/DRILL-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert V updated DRILL-6209:
----------------------------
    Description: 
Parquet files generated using Apache Spark 2.2.1 Timestamp column type might 
fail when the 'store.parquet.reader.int96_as_timestamp' option is used.

Query that fails:
{code:java}
ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = TRUE;

SELECT t.* FROM 
dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
 t;
{code}
Query that succeeds:
{code:java}
ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = FALSE;

SELECT CONVERT_FROM(t.date_time, 'TIMESTAMP_IMPALA') AS ts, t.* FROM 
dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
 t;
{code}
See logs attached for further details. The cause of the issue is:

 
{noformat}
Caused by: java.lang.ClassCastException: 
org.apache.drill.exec.vector.TimeStampVector cannot be cast to 
org.apache.drill.exec.vector.VarBinaryVector{noformat}

 I'm not able to upload sample Parquet files because they contain sensitive 
information.
 Parquet files, generated using an Apache Spark job by AWS EMR, failed. File 
sizes are in the range of hundreds of megabytes.
 Parquet files, generated by a local Spark installation, worked however. They 
only contained a few rows so it wasn't an accurate comparison with the larger 
data set.
  
 The bug is present in the current master branch (1.13.0 candidate version).
  
 This issue is related to DRILL-5097

  was:
Parquet files generated using Apache Spark 2.2.1 Timestamp column type might 
fail when the 'store.parquet.reader.int96_as_timestamp' option is used.

Query that fails:
{code:java}
ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = TRUE;

SELECT t.* FROM 
dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
 t;
{code}


Query that succeeds:
{code:java}
ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = FALSE;

SELECT CONVERT_FROM(t.date_time, 'TIMESTAMP_IMPALA') AS ts, t.* FROM 
dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
 t;
{code}


See logs attached.
I'm not able to upload sample Parquet files because they contain sensitive 
information.
Parquet files, generated using an Apache Spark job by AWS EMR, failed. File 
sizes are in the range of hundreds of megabytes.
Parquet files, generated by a local Spark installation, worked however. They 
only contained a few rows so it wasn't an accurate comparison with the larger 
data set.
 
The bug is present in the current master branch (1.13.0 candidate version).
 
This issue is related to 
[DRILL-5097|https://issues.apache.org/jira/browse/DRILL-5097]


> Spark generated Parquet file reading fails when 
> 'store.parquet.reader.int96_as_timestamp' is used
> -------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-6209
>                 URL: https://issues.apache.org/jira/browse/DRILL-6209
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.13.0
>         Environment: * Parquet files, that failed to query, were generated 
> using Apache Spark 2.2.1 on AWS EMR. The Spark SQL library was used.
>  * Drill was set up on a Mac OS El Capitan system, running Java 8.
>            Reporter: Robert V
>            Priority: Major
>         Attachments: error-stacktrace.txt, successful-log.txt
>
>
> Parquet files generated using Apache Spark 2.2.1 Timestamp column type might 
> fail when the 'store.parquet.reader.int96_as_timestamp' option is used.
> Query that fails:
> {code:java}
> ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = TRUE;
> SELECT t.* FROM 
> dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
>  t;
> {code}
> Query that succeeds:
> {code:java}
> ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = FALSE;
> SELECT CONVERT_FROM(t.date_time, 'TIMESTAMP_IMPALA') AS ts, t.* FROM 
> dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
>  t;
> {code}
> See logs attached for further details. The cause of the issue is:
>  
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.drill.exec.vector.TimeStampVector cannot be cast to 
> org.apache.drill.exec.vector.VarBinaryVector{noformat}
>  I'm not able to upload sample Parquet files because they contain sensitive 
> information.
>  Parquet files, generated using an Apache Spark job by AWS EMR, failed. File 
> sizes are in the range of hundreds of megabytes.
>  Parquet files, generated by a local Spark installation, worked however. They 
> only contained a few rows so it wasn't an accurate comparison with the larger 
> data set.
>   
>  The bug is present in the current master branch (1.13.0 candidate version).
>   
>  This issue is related to DRILL-5097



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (DRILL-6209) Spark generated Parquet file reading fails when 'store.parquet.reader.int96_as_timestamp' is used

Reply via email to