Robert V created DRILL-6209:
-------------------------------
Summary: Spark generated Parquet file reading fails when
'store.parquet.reader.int96_as_timestamp' is used
Key: DRILL-6209
URL: https://issues.apache.org/jira/browse/DRILL-6209
Project: Apache Drill
Issue Type: Bug
Components: Storage - Parquet
Affects Versions: 1.13.0
Environment: * Parquet files, that failed to query, were generated
using Apache Spark 2.2.1 on AWS EMR. The Spark SQL library was used.
* Drill was set up on a Mac OS El Capitan system, running Java 8.
Reporter: Robert V
Attachments: error-stacktrace.txt, successful-log.txt
Parquet files generated using Apache Spark 2.2.1 Timestamp column type might
fail when the 'store.parquet.reader.int96_as_timestamp' option is used.
Query that fails:
{code:java}
ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = TRUE;
SELECT t.* FROM
dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
t;
{code}
Query that succeeds:
{code:java}
ALTER SESSION SET `store.parquet.reader.int96_as_timestamp` = FALSE;
SELECT CONVERT_FROM(t.date_time, 'TIMESTAMP_IMPALA') AS ts, t.* FROM
dfs.`/Workspace/Data/part-00000-3b8917e1-0bdb-4b34-90c5-1ca667e06767-c000.snappy.parquet`
t;
{code}
See logs attached.
I'm not able to upload sample Parquet files because they contain sensitive
information.
Parquet files, generated using an Apache Spark job by AWS EMR, failed. File
sizes are in the range of hundreds of megabytes.
Parquet files, generated by a local Spark installation, worked however. They
only contained a few rows so it wasn't an accurate comparison with the larger
data set.
The bug is present in the current master branch (1.13.0 candidate version).
This issue is related to
[DRILL-5097|https://issues.apache.org/jira/browse/DRILL-5097]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)