[jira] [Created] (SPARK-31770) spark double count issue when table is recreated from hive

Nithin (Jira) Wed, 20 May 2020 02:06:09 -0700

Nithin created SPARK-31770:
------------------------------

             Summary: spark double count issue when table is recreated from hive
                 Key: SPARK-31770
                 URL: https://issues.apache.org/jira/browse/SPARK-31770
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.3.2
            Reporter: Nithin



Spark sets serde property named path when a table is created. Along with it , 
it will add a property called sql.source.provider that ensures the consistency. 
But if we recreate the same table using hive , then the hive will not respect 
the serde properties set by spark named sql.source.provider. This will lead to 
conflict if we read the newly created table using spark . 

 

It can be argued that the issue is with hive because it do not properly ensures 
that the table properties are being copied. But these are the properties set by 
spark on a hive table. and spark should be the one adhering to the standards of 
hive. From a corporate standpoint , we cannot tell users to avoid creating or 
altering tables on hive because spark will end up giving wrong results. *Worst 
of all , even if it failed then it was okay. but it simply gives double count 
without even a failure message and that is very dangerous for critical 
applications*. Either end up giving error message if the serde property does 
not have path as well as sql.source.provider (or) do not even try to create the 
property path in first place.  Few related issues : 
 # SPARK-31751
 # SPARK-28266 was resolved as non-reproducible which is not entirely true

 

*steps to reproduce :*

 
{code:java}
df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
df.createOrReplaceTempView("test1")
spark.sql("CREATE TABLE test1_using_orc USING ORC AS (SELECT * from test1)")
{code}
 
----
Run the below from hive ( preferably tez )

 
{code:java}
create table test2_like_test1 like test1_using_orc;
insert into test2_like_test1 select * from test1_using_orc;{code}

-------from spark again : Now test2_like_test1 will have serde property called 
path , but will not have sql.source.provider. 

 

 
{code:java}
>>> spark.sql("set spark.sql.hive.convertMetastoreOrc=false").show()
+--------------------+-----+
| key|value|
+--------------------+-----+
|spark.sql.hive.co...|false|
+--------------------+-----+
>>> spark.sql("select count(*) from test2_like_test1").show()
+--------+
|count(1)|
+--------+
| 1|
+--------+
>>> spark.sql("set spark.sql.hive.convertMetastoreOrc=true").show()
+--------------------+-----+
| key|value|
+--------------------+-----+
|spark.sql.hive.co...| true|
+--------------------+-----+
>>> spark.sql("select count(*) from test2_like_test1").show()
+--------+
|count(1)|
+--------+
| 2|
+--------+
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-31770) spark double count issue when table is recreated from hive

Reply via email to