[
https://issues.apache.org/jira/browse/SPARK-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288078#comment-15288078
]
Euan de Kock commented on SPARK-15374:
--------------------------------------
Sample script to replicate this error (Last line will fail with the NPE error):
Build an example table.
{quote}
CREATE EXTERNAL TABLE ALL_DATA(
tag_tm TIMESTAMP,
tag_val STRING,
tag_confidence string,
minval string,
maxval string
)
STORED AS PARQUET
LOCATION 'hdfs:///user/hive/warehouse/all_data/'
TBLPROPERTIES ("parquet.compress" = "GZIP");
{quote}
Populate this with a single row with non-null data
{quote}
INSERT INTO all_data (tag_tm, tag_val, tag_confidence, minval, maxval)
VALUES('2016-05-18 09:03:00', '100.00', '100', '99.00', '101.00');
{quote}
{quote}
select * from all_data;
{quote}
Now create a table to hold some NULL columns
{quote}
CREATE EXTERNAL TABLE ALL_DATA_HIVE_NULL(
tag_tm TIMESTAMP,
tag_val STRING,
tag_confidence string,
minval string,
maxval string
)
STORED AS PARQUET
LOCATION 'hdfs:///user/hive/warehouse/all_data_hive_null/'
TBLPROPERTIES ("parquet.compress" = "GZIP");
{quote}
Populate it
{quote}
insert overwrite table all_data_hive_null
select tag_tm, tag_val, tag_confidence, null, null from all_data;
{quote}
{quote}
select * from all_data_hive_null;
{quote}
Create an equivalent table from within the Spark Shell by directly copying the
new table
{quote}
{color:green}
/*
# SPARK-SHELL Commands
val df = sqlContext.sql("select tag_tm, tag_val, tag_confidence, minval,
maxval from parquet.`hdfs:///user/hive/warehouse/all_data_hive_null/`")
df.show
df.write.mode("overwrite").parquet("hdfs:///user/hive/warehouse/all_data_spark_null")
*/
{color}
{quote}
Reference this new table from Hive
{quote}
CREATE EXTERNAL TABLE ALL_DATA_SPARK_NULL(
tag_tm TIMESTAMP,
tag_val STRING,
tag_confidence string,
minval string,
maxval string
)
STORED AS PARQUET
LOCATION 'hdfs:///user/hive/warehouse/all_data_spark_null/'
TBLPROPERTIES ("parquet.compress" = "GZIP");
{quote}
And try to select the data
{quote}
select * from all_data_spark_null;
{quote}
Error message from Hive via JDBC is:
{quote}
An error occurred when executing the SQL command:
select * from all_data_spark_null
[Amazon][HiveJDBCDriver](500312) Error in fetching data rows:
*org.apache.hive.service.cli.HiveSQLException:java.io.IOException:
java.lang.NullPointerException:26:25; [SQL State=HY000, DB Errorcode=500312]
Execution time: 0.9s
1 statement failed.
{quote}
And from Hive command line is:
{quote}
hive> select * from all_data_spark_null;
OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
details.
Failed with exception java.io.IOException:java.lang.NullPointerException
Time taken: 0.761 seconds
hive>
{quote}
> Spark created Parquet files cause NPE when a column has only NULL values
> ------------------------------------------------------------------------
>
> Key: SPARK-15374
> URL: https://issues.apache.org/jira/browse/SPARK-15374
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.6.0, 1.6.1
> Environment: AWS EMR running Spark 1.6.1
> Reporter: Euan de Kock
>
> When an external table is built from Spark, and is subsequently accessed by
> Hive it will generate an NPE error if one of the columns contains only null
> values. Spark (and Presto) can successfully read this data, but Hive cannot.
> If the same dataset is created by Hive, it is readable by all systems.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]