[
https://issues.apache.org/jira/browse/DRILL-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert V updated DRILL-6209:
----------------------------
Description:
Parquet files generated using Apache Spark 2.2.1 Timestamp column type might
fail when the 'store.parquet.reader.int96_as_timestamp' option is used.
Query that fails:
{code:java}
ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = TRUE;
SELECT t.* FROM
dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
t;
{code}
Query that succeeds:
{code:java}
ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = FALSE;
SELECT CONVERT_FROM(t.date_time, 'TIMESTAMP_IMPALA') AS ts, t.* FROM
dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
t;
{code}
See logs attached for further details. The cause of the issue is:
{noformat}
Caused by: java.lang.ClassCastException:
org.apache.drill.exec.vector.TimeStampVector cannot be cast to
org.apache.drill.exec.vector.VarBinaryVector{noformat}
I'm not able to upload sample Parquet files because they contain sensitive
information.
Parquet files, generated using an Apache Spark job by AWS EMR, failed. File
sizes are in the range of hundreds of megabytes.
Parquet files, generated by a local Spark installation, worked however. They
only contained a few rows so it wasn't an accurate comparison with the larger
data set.
The bug is present in the current master branch (1.13.0 candidate version).
This issue is related to DRILL-5097
was:
Parquet files generated using Apache Spark 2.2.1 Timestamp column type might
fail when the 'store.parquet.reader.int96_as_timestamp' option is used.
Query that fails:
{code:java}
ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = TRUE;
SELECT t.* FROM
dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
t;
{code}
Query that succeeds:
{code:java}
ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = FALSE;
SELECT CONVERT_FROM(t.date_time, 'TIMESTAMP_IMPALA') AS ts, t.* FROM
dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
t;
{code}
See logs attached for further details. The cause of the issue is:
{noformat}
Caused by: java.lang.ClassCastException:
org.apache.drill.exec.vector.TimeStampVector cannot be cast to
org.apache.drill.exec.vector.VarBinaryVector{noformat}
I'm not able to upload sample Parquet files because they contain sensitive
information.
Parquet files, generated using an Apache Spark job by AWS EMR, failed. File
sizes are in the range of hundreds of megabytes.
Parquet files, generated by a local Spark installation, worked however. They
only contained a few rows so it wasn't an accurate comparison with the larger
data set.
The bug is present in the current master branch (1.13.0 candidate version).
This issue is related to DRILL-5097
> Spark generated Parquet file reading fails when
> 'store.parquet.reader.int96_as_timestamp' is used
> -------------------------------------------------------------------------------------------------
>
> Key: DRILL-6209
> URL: https://issues.apache.org/jira/browse/DRILL-6209
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.13.0
> Environment: * Parquet files, that failed to query, were generated
> using Apache Spark 2.2.1 on AWS EMR. The Spark SQL library was used.
> * Drill was set up on a Mac OS El Capitan system, running Java 8.
> Reporter: Robert V
> Priority: Major
> Attachments: error-stacktrace.txt, successful-log.txt
>
>
> Parquet files generated using Apache Spark 2.2.1 Timestamp column type might
> fail when the 'store.parquet.reader.int96_as_timestamp' option is used.
> Query that fails:
> {code:java}
> ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = TRUE;
> SELECT t.* FROM
> dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
> t;
> {code}
> Query that succeeds:
> {code:java}
> ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = FALSE;
> SELECT CONVERT_FROM(t.date_time, 'TIMESTAMP_IMPALA') AS ts, t.* FROM
> dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
> t;
> {code}
> See logs attached for further details. The cause of the issue is:
> {noformat}
> Caused by: java.lang.ClassCastException:
> org.apache.drill.exec.vector.TimeStampVector cannot be cast to
> org.apache.drill.exec.vector.VarBinaryVector{noformat}
>
> I'm not able to upload sample Parquet files because they contain sensitive
> information.
> Parquet files, generated using an Apache Spark job by AWS EMR, failed. File
> sizes are in the range of hundreds of megabytes.
> Parquet files, generated by a local Spark installation, worked however. They
> only contained a few rows so it wasn't an accurate comparison with the larger
> data set.
>
> The bug is present in the current master branch (1.13.0 candidate version).
>
> This issue is related to DRILL-5097
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)