Nithin created SPARK-31770:
------------------------------
Summary: spark double count issue when table is recreated from hive
Key: SPARK-31770
URL: https://issues.apache.org/jira/browse/SPARK-31770
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.3.2
Reporter: Nithin
Spark sets serde property named path when a table is created. Along with it ,
it will add a property called sql.source.provider that ensures the consistency.
But if we recreate the same table using hive , then the hive will not respect
the serde properties set by spark named sql.source.provider. This will lead to
conflict if we read the newly created table using spark .
It can be argued that the issue is with hive because it do not properly ensures
that the table properties are being copied. But these are the properties set by
spark on a hive table. and spark should be the one adhering to the standards of
hive. From a corporate standpoint , we cannot tell users to avoid creating or
altering tables on hive because spark will end up giving wrong results. *Worst
of all , even if it failed then it was okay. but it simply gives double count
without even a failure message and that is very dangerous for critical
applications*. Either end up giving error message if the serde property does
not have path as well as sql.source.provider (or) do not even try to create the
property path in first place. Few related issues :
# SPARK-31751
# SPARK-28266 was resolved as non-reproducible which is not entirely true
*steps to reproduce :*
{code:java}
df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
df.createOrReplaceTempView("test1")
spark.sql("CREATE TABLE test1_using_orc USING ORC AS (SELECT * from test1)")
{code}
----
Run the below from hive ( preferably tez )
{code:java}
create table test2_like_test1 like test1_using_orc;
insert into test2_like_test1 select * from test1_using_orc;{code}
-------from spark again : Now test2_like_test1 will have serde property called
path , but will not have sql.source.provider.
{code:java}
>>> spark.sql("set spark.sql.hive.convertMetastoreOrc=false").show()
+--------------------+-----+
| key|value|
+--------------------+-----+
|spark.sql.hive.co...|false|
+--------------------+-----+
>>> spark.sql("select count(*) from test2_like_test1").show()
+--------+
|count(1)|
+--------+
| 1|
+--------+
>>> spark.sql("set spark.sql.hive.convertMetastoreOrc=true").show()
+--------------------+-----+
| key|value|
+--------------------+-----+
|spark.sql.hive.co...| true|
+--------------------+-----+
>>> spark.sql("select count(*) from test2_like_test1").show()
+--------+
|count(1)|
+--------+
| 2|
+--------+
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]