[ 
https://issues.apache.org/jira/browse/SPARK-15374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288078#comment-15288078
 ] 

Euan de Kock commented on SPARK-15374:
--------------------------------------

Sample script to replicate this error (Last line will fail with the NPE error):

Build an example table.
{quote}
CREATE EXTERNAL TABLE ALL_DATA(
tag_tm TIMESTAMP,
tag_val STRING,
tag_confidence string,
minval string,
maxval string
)
STORED AS PARQUET
LOCATION 'hdfs:///user/hive/warehouse/all_data/'
TBLPROPERTIES ("parquet.compress" = "GZIP");
{quote}
Populate this with a single row with non-null data
{quote}
INSERT INTO all_data (tag_tm, tag_val, tag_confidence, minval, maxval) 
VALUES('2016-05-18 09:03:00', '100.00', '100', '99.00', '101.00');
{quote}

{quote}
select * from all_data;
{quote}

Now create a table to hold some NULL columns
{quote}
CREATE EXTERNAL TABLE ALL_DATA_HIVE_NULL(
tag_tm TIMESTAMP,
tag_val STRING,
tag_confidence string,
minval string,
maxval string
)
STORED AS PARQUET
LOCATION 'hdfs:///user/hive/warehouse/all_data_hive_null/'
TBLPROPERTIES ("parquet.compress" = "GZIP");
{quote}

Populate it
{quote}
insert overwrite table all_data_hive_null
select tag_tm, tag_val, tag_confidence, null, null from all_data;
{quote}

{quote}
select * from all_data_hive_null;
{quote}

Create an equivalent table from within the Spark Shell by directly copying the 
new table
{quote}
{color:green}
/*
# SPARK-SHELL Commands
 val df = sqlContext.sql("select tag_tm, tag_val, tag_confidence, minval, 
maxval from parquet.`hdfs:///user/hive/warehouse/all_data_hive_null/`")
 df.show
 
df.write.mode("overwrite").parquet("hdfs:///user/hive/warehouse/all_data_spark_null")
*/
{color}
{quote}

Reference this new table from Hive
{quote}
CREATE EXTERNAL TABLE ALL_DATA_SPARK_NULL(
tag_tm TIMESTAMP,
tag_val STRING,
tag_confidence string,
minval string,
maxval string
)
STORED AS PARQUET
LOCATION 'hdfs:///user/hive/warehouse/all_data_spark_null/'
TBLPROPERTIES ("parquet.compress" = "GZIP");
{quote}

And try to select the data
{quote}
select * from all_data_spark_null; 
{quote}

Error message from Hive via JDBC  is:
{quote}
An error occurred when executing the SQL command:
select * from all_data_spark_null

[Amazon][HiveJDBCDriver](500312) Error in fetching data rows: 
*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: 
java.lang.NullPointerException:26:25; [SQL State=HY000, DB Errorcode=500312]

Execution time: 0.9s

1 statement failed.
{quote}

And from Hive command line is:

{quote}
hive> select * from all_data_spark_null;
OK
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
Failed with exception java.io.IOException:java.lang.NullPointerException
Time taken: 0.761 seconds
hive>
{quote}

> Spark created Parquet files cause NPE when a column has only NULL values
> ------------------------------------------------------------------------
>
>                 Key: SPARK-15374
>                 URL: https://issues.apache.org/jira/browse/SPARK-15374
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0, 1.6.1
>         Environment: AWS EMR running Spark 1.6.1
>            Reporter: Euan de Kock
>
> When an external table is built from Spark, and is subsequently accessed by 
> Hive it will generate an NPE error if one of the columns contains only null 
> values. Spark (and Presto) can successfully read this data, but Hive cannot. 
> If the same dataset is created by Hive, it is readable by all systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to