Oscar Cassetti created SPARK-27695:
--------------------------------------
Summary: SELECT * returns null column when reading from Hive / ORC
and spark.sql.hive.convertMetastoreOrc=true
Key: SPARK-27695
URL: https://issues.apache.org/jira/browse/SPARK-27695
Project: Spark
Issue Type: Bug
Components: Input/Output, Spark Core
Affects Versions: 2.3.3, 2.3.2, 2.3.1
Reporter: Oscar Cassetti
If you do
{code:java}
select * from hive.some_table{code}
and the underlying data does not match exactly the schema the last column is
returned as null
Example
{code:java}
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import *
conf = SparkConf().set('spark.sql.hive.convertMetastoreOrc', 'true')
spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
data = [{'a':i, 'b':i+10, 'd':{'a':i, 'b':i+10}} for i in range(1, 100)]
data_schema = StructType([StructField('a', LongType(), True),
StructField('b', LongType(), True),
StructField('d', MapType(StringType(), LongType(), True), True)
])
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(data_schema)
df.write.format("orc").save("./sample_data/")
spark.sql("""create external table tmp(
a bigint,
b bigint,
d map<string, bigint>)
stored as orc
location 'sample_data/'
""")
spark.sql("select * from tmp").show()
{code}
This return correctl
{noformat}
+---+---+-------------------+
| a| b| d|
+---+---+-------------------+
| 85| 95| [a -> 85, b -> 95]|
| 86| 96| [a -> 86, b -> 96]|
| 87| 97| [a -> 87, b -> 97]|
| 88| 98| [a -> 88, b -> 98]|
| 89| 99| [a -> 89, b -> 99]|
| 90|100|[a -> 90, b -> 100]|
{noformat}
However if add a new column in the underlying data without altering the hive
schema
the last column of the hive schema is set to null
{code}
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import *
conf = SparkConf().set('spark.sql.hive.convertMetastoreOrc', 'true')
spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
data = [{'a':i, 'b':i+10, 'c':i+5, 'd':{'a':i, 'b':i+10, 'c':i+5}} for i in
range(1, 100)]
data_schema = StructType([StructField('a', LongType(), True),
StructField('b', LongType(), True),
StructField('c', LongType(), True),
StructField('d', MapType(StringType(),
LongType(), True), True)
])
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(data_schema)
df.write.format("orc").mode("overwrite").save("./sample_data/")
spark.sql("select * from tmp").show()
spark.read.orc("./sample_data/").show()
{code}
The first show() returns
{noformat}
+---+---+----+
| a| b| d|
+---+---+----+
| 85| 95|null|
| 86| 96|null|
| 87| 97|null|
| 88| 98|null|
| 89| 99|null|
| 90|100|null|
| 91|101|null|
| 92|102|null|
| 93|103|null|
| 94|104|null|
{noformat}
But the data on disk is correct
{noformat}
+---+---+---+--------------------+
| a| b| c| d|
+---+---+---+--------------------+
| 85| 95| 90|[a -> 85, b -> 95...|
| 86| 96| 91|[a -> 86, b -> 96...|
| 87| 97| 92|[a -> 87, b -> 97...|
| 88| 98| 93|[a -> 88, b -> 98...|
| 89| 99| 94|[a -> 89, b -> 99...|
| 90|100| 95|[a -> 90, b -> 10...|
| 91|101| 96|[a -> 91, b -> 10...|
{noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]