[GitHub] [spark] HyukjinKwon commented on a change in pull request #25090: [SPARK-28278][SQL][PYTHON][TESTS] Convert and port 'except-all.sql' into UDF test base

GitBox Tue, 09 Jul 2019 21:22:26 -0700

HyukjinKwon commented on a change in pull request #25090: 
[SPARK-28278][SQL][PYTHON][TESTS] Convert and port 'except-all.sql' into UDF 
test base
URL: https://github.com/apache/spark/pull/25090#discussion_r301880585


 ##########
 File path: sql/core/src/test/resources/sql-tests/inputs/udf/udf-except-all.sql
 ##########
 @@ -0,0 +1,164 @@
+-- This test file was converted from except-all.sql.
+-- Note that currently registered UDF returns a string. So there are some 
differences, for instance
+-- in string cast within UDF in Scala and Python.
+
+CREATE TEMPORARY VIEW tab1 AS SELECT * FROM VALUES
+    (0), (1), (2), (2), (2), (2), (3), (null), (null) AS tab1(c1);
+CREATE TEMPORARY VIEW tab2 AS SELECT * FROM VALUES
+    (1), (2), (2), (3), (5), (5), (null) AS tab2(c1);
+CREATE TEMPORARY VIEW tab3 AS SELECT * FROM VALUES
+    (1, 2), 
+    (1, 2),
+    (1, 3),
+    (2, 3),
+    (2, 2)
+    AS tab3(k, v);
+CREATE TEMPORARY VIEW tab4 AS SELECT * FROM VALUES
+    (1, 2), 
+    (2, 3),
+    (2, 2),
+    (2, 2),
+    (2, 20)
+    AS tab4(k, v);
+
+-- Basic EXCEPT ALL
+SELECT * FROM tab1
+EXCEPT ALL
+SELECT * FROM tab2;
+
+-- MINUS ALL (synonym for EXCEPT)
+SELECT * FROM tab1
+MINUS ALL
+SELECT * FROM tab2;
+
+-- EXCEPT ALL same table in both branches
+SELECT * FROM tab1
+EXCEPT ALL
+SELECT * FROM tab2 WHERE udf(c1) IS NOT NULL;
+
+-- Empty left relation
+SELECT * FROM tab1 WHERE udf(c1) > 5
+EXCEPT ALL
+SELECT * FROM tab2;
+
+-- Empty right relation
+SELECT * FROM tab1
+EXCEPT ALL
+SELECT * FROM tab2 WHERE c1 > udf(6);
+
+-- Type Coerced ExceptAll
+SELECT * FROM tab1
+EXCEPT ALL
+SELECT CAST(udf(1) AS BIGINT);
+
+-- Error as types of two side are not compatible
+SELECT * FROM tab1
+EXCEPT ALL
+SELECT array(1);
+
+-- Basic
+SELECT * FROM tab3
+EXCEPT ALL
+SELECT * FROM tab4;
+
+-- Basic
+SELECT * FROM tab4
+EXCEPT ALL
+SELECT * FROM tab3;
+
+-- EXCEPT ALL + INTERSECT
+SELECT * FROM tab4
+EXCEPT ALL
+SELECT * FROM tab3
+INTERSECT DISTINCT
+SELECT * FROM tab4;
+
+-- EXCEPT ALL + EXCEPT
+SELECT * FROM tab4
+EXCEPT ALL
+SELECT * FROM tab3
+EXCEPT DISTINCT
+SELECT * FROM tab4;
+
+-- Chain of set operations
+SELECT * FROM tab3
+EXCEPT ALL
+SELECT * FROM tab4
+UNION ALL
+SELECT * FROM tab3
+EXCEPT DISTINCT
+SELECT * FROM tab4;
+
+-- Mismatch on number of columns across both branches
+SELECT k FROM tab3
+EXCEPT ALL
+SELECT k, v FROM tab4;
+
+-- Chain of set operations
+SELECT * FROM tab3
+EXCEPT ALL
+SELECT * FROM tab4
+UNION
+SELECT * FROM tab3
+EXCEPT DISTINCT
+SELECT * FROM tab4;
+
+-- Using MINUS ALL
+SELECT * FROM tab3
+MINUS ALL
+SELECT * FROM tab4
+UNION
+SELECT * FROM tab3
+MINUS DISTINCT
+SELECT * FROM tab4;
+
+-- Chain of set operations
+SELECT * FROM tab3
+EXCEPT ALL
+SELECT * FROM tab4
+EXCEPT DISTINCT
+SELECT * FROM tab3
+EXCEPT DISTINCT
+SELECT * FROM tab4;
+
+-- Join under except all. Should produce empty resultset since both left and 
right sets 
+-- are same.
+SELECT * 
+FROM   (SELECT udf(tab3.k),
+               udf(tab4.v)
+        FROM   tab3 
+               JOIN tab4 
+                 ON tab3.k = tab4.k)
 
 Review comment:
   Hm, sounds possibly this is a bug. Let's create a JIRA.
   BTW, if possible, it would be better to create a JIRA with a minimised and 
narrowed-down reproducer in its JIRA description - actually this is one of the 
key points of why we're doing this :D.
   
   For instance, 
   
   ```python
   from pyspark.sql.functions import pandas_udf, PandasUDFType
   
   
   @pandas_udf("string", PandasUDFType.SCALAR)
   def noop(x):
       return x + 1
   
   spark.udf.register("udf", noop)
   
   spark.sql("CREATE TEMPORARY VIEW ...")
   spark.sql("...udf(...)...").show()
   ```
   
   _If possible_, It might be even better if we can reproduce it via Python 
native APIs as well since it'd be very likely reproducible with Python API 
itself. For instance,
   
   ```python
   from pyspark.sql.functions import pandas_udf, PandasUDFType
   
   
   @pandas_udf("string", PandasUDFType.SCALAR)
   def noop(x):
       return x + 1
   
   
   df1 = ...
   df2 = ...
   df1.join(df2 ...).show()
   ```
   
   It might be even better to show that it works in Scala API with a minimised 
reproducer in the JIRA description. That will make other contributors and 
committers can easily focus on bug-fixing itself alone.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #25090: [SPARK-28278][SQL][PYTHON][TESTS] Convert and port 'except-all.sql' into UDF test base

Reply via email to