HyukjinKwon commented on a change in pull request #25090:
[SPARK-28278][SQL][PYTHON][TESTS] Convert and port 'except-all.sql' into UDF
test base
URL: https://github.com/apache/spark/pull/25090#discussion_r301880585
##########
File path: sql/core/src/test/resources/sql-tests/inputs/udf/udf-except-all.sql
##########
@@ -0,0 +1,164 @@
+-- This test file was converted from except-all.sql.
+-- Note that currently registered UDF returns a string. So there are some
differences, for instance
+-- in string cast within UDF in Scala and Python.
+
+CREATE TEMPORARY VIEW tab1 AS SELECT * FROM VALUES
+ (0), (1), (2), (2), (2), (2), (3), (null), (null) AS tab1(c1);
+CREATE TEMPORARY VIEW tab2 AS SELECT * FROM VALUES
+ (1), (2), (2), (3), (5), (5), (null) AS tab2(c1);
+CREATE TEMPORARY VIEW tab3 AS SELECT * FROM VALUES
+ (1, 2),
+ (1, 2),
+ (1, 3),
+ (2, 3),
+ (2, 2)
+ AS tab3(k, v);
+CREATE TEMPORARY VIEW tab4 AS SELECT * FROM VALUES
+ (1, 2),
+ (2, 3),
+ (2, 2),
+ (2, 2),
+ (2, 20)
+ AS tab4(k, v);
+
+-- Basic EXCEPT ALL
+SELECT * FROM tab1
+EXCEPT ALL
+SELECT * FROM tab2;
+
+-- MINUS ALL (synonym for EXCEPT)
+SELECT * FROM tab1
+MINUS ALL
+SELECT * FROM tab2;
+
+-- EXCEPT ALL same table in both branches
+SELECT * FROM tab1
+EXCEPT ALL
+SELECT * FROM tab2 WHERE udf(c1) IS NOT NULL;
+
+-- Empty left relation
+SELECT * FROM tab1 WHERE udf(c1) > 5
+EXCEPT ALL
+SELECT * FROM tab2;
+
+-- Empty right relation
+SELECT * FROM tab1
+EXCEPT ALL
+SELECT * FROM tab2 WHERE c1 > udf(6);
+
+-- Type Coerced ExceptAll
+SELECT * FROM tab1
+EXCEPT ALL
+SELECT CAST(udf(1) AS BIGINT);
+
+-- Error as types of two side are not compatible
+SELECT * FROM tab1
+EXCEPT ALL
+SELECT array(1);
+
+-- Basic
+SELECT * FROM tab3
+EXCEPT ALL
+SELECT * FROM tab4;
+
+-- Basic
+SELECT * FROM tab4
+EXCEPT ALL
+SELECT * FROM tab3;
+
+-- EXCEPT ALL + INTERSECT
+SELECT * FROM tab4
+EXCEPT ALL
+SELECT * FROM tab3
+INTERSECT DISTINCT
+SELECT * FROM tab4;
+
+-- EXCEPT ALL + EXCEPT
+SELECT * FROM tab4
+EXCEPT ALL
+SELECT * FROM tab3
+EXCEPT DISTINCT
+SELECT * FROM tab4;
+
+-- Chain of set operations
+SELECT * FROM tab3
+EXCEPT ALL
+SELECT * FROM tab4
+UNION ALL
+SELECT * FROM tab3
+EXCEPT DISTINCT
+SELECT * FROM tab4;
+
+-- Mismatch on number of columns across both branches
+SELECT k FROM tab3
+EXCEPT ALL
+SELECT k, v FROM tab4;
+
+-- Chain of set operations
+SELECT * FROM tab3
+EXCEPT ALL
+SELECT * FROM tab4
+UNION
+SELECT * FROM tab3
+EXCEPT DISTINCT
+SELECT * FROM tab4;
+
+-- Using MINUS ALL
+SELECT * FROM tab3
+MINUS ALL
+SELECT * FROM tab4
+UNION
+SELECT * FROM tab3
+MINUS DISTINCT
+SELECT * FROM tab4;
+
+-- Chain of set operations
+SELECT * FROM tab3
+EXCEPT ALL
+SELECT * FROM tab4
+EXCEPT DISTINCT
+SELECT * FROM tab3
+EXCEPT DISTINCT
+SELECT * FROM tab4;
+
+-- Join under except all. Should produce empty resultset since both left and
right sets
+-- are same.
+SELECT *
+FROM (SELECT udf(tab3.k),
+ udf(tab4.v)
+ FROM tab3
+ JOIN tab4
+ ON tab3.k = tab4.k)
Review comment:
Hm, sounds possibly this is a bug. Let's create a JIRA.
BTW, if possible, it would be better to create a JIRA with a minimised and
narrowed-down reproducer in its JIRA description - actually this is one of the
key points of why we're doing this :D.
For instance,
```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf("string", PandasUDFType.SCALAR)
def noop(x):
return x + 1
spark.udf.register("udf", noop)
spark.sql("CREATE TEMPORARY VIEW ...")
spark.sql("...udf(...)...").show()
```
_If possible_, It might be even better if we can reproduce it via Python
native APIs as well since it'd be very likely reproducible with Python API
itself. For instance,
```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf("string", PandasUDFType.SCALAR)
def noop(x):
return x + 1
df1 = ...
df2 = ...
df1.join(df2 ...).show()
```
It might be even better to show that it works in Scala API with a minimised
reproducer in the JIRA description. That will make other contributors and
committers can easily focus on bug-fixing itself alone.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]