huaxingao commented on a change in pull request #25101: 
[SPARK-28277][SQL][PYTHON][TESTS] Convert and port 'except.sql' into UDF test 
base
URL: https://github.com/apache/spark/pull/25101#discussion_r303188362
 
 

 ##########
 File path: sql/core/src/test/resources/sql-tests/inputs/udf/udf-except.sql
 ##########
 @@ -0,0 +1,58 @@
+-- This test file was converted from except.sql.
+-- Tests different scenarios of except operation
+create temporary view t1 as select * from values
+  ("one", 1),
+  ("two", 2),
+  ("three", 3),
+  ("one", NULL)
+  as t1(k, v);
+
+create temporary view t2 as select * from values
+  ("one", 1),
+  ("two", 22),
+  ("one", 5),
+  ("one", NULL),
+  (NULL, 5)
+  as t2(k, v);
+
+
+-- Except operation that will be replaced by left anti join
+SELECT * FROM t1 EXCEPT SELECT * FROM t2;
+
+
+-- Except operation that will be replaced by Filter: SPARK-22181
+SELECT * FROM t1 EXCEPT SELECT * FROM t1 where udf(v) <> 1 and v <> udf(2);
+
+
+-- Except operation that will be replaced by Filter: SPARK-22181
+SELECT * FROM t1 where udf(v) <> 1 and v <> udf(22) EXCEPT SELECT * FROM t1 
where udf(v) <> 2 and v >= udf(3);
+
+
+-- Except operation that will be replaced by Filter: SPARK-22181
+SELECT t1.* FROM t1, t2 where t1.k = t2.k
+EXCEPT
+SELECT t1.* FROM t1, t2 where t1.k = t2.k and t1.k != udf('one');
+
+
+-- Except operation that will be replaced by left anti join
+SELECT * FROM t2 where v >= udf(1) and udf(v) <> 22 EXCEPT SELECT * FROM t1;
+
+
+-- Except operation that will be replaced by left anti join
+SELECT (SELECT min(udf(k)) FROM t2 WHERE t2.k = t1.k) min_t2 FROM t1
+MINUS
+SELECT (SELECT udf(min(k)) FROM t2) abs_min_t2 FROM t1 WHERE  t1.k = 
udf('one');
+
+
+-- Except operation that will be replaced by left anti join
+SELECT t1.k
+FROM   t1
+WHERE  t1.v <= (SELECT   max(udf(t2.v))
 
 Review comment:
   I did some more tests on this ```max```. Here are the results:
   ```
   -- !query 10
   SELECT max(t2.v) FROM t2
   -- !query 10 schema
   struct<max(v):int>
   -- !query 10 output
   22
   
   
   -- !query 11
   SELECT udf(t2.v) FROM t2
   -- !query 11 schema
   struct<udf(v):string>
   -- !query 11 output
   1
   22
   5.0
   5.0
   nan
   
   
   -- !query 12
   SELECT max(udf(t2.v)) FROM t2
   -- !query 12 schema
   struct<max(udf(v)):string>
   -- !query 12 output
   nan
   
   
   -- !query 13
   SELECT udf(max(t2.v)) FROM t2
   -- !query 13 schema
   struct<udf(max(v)):string>
   -- !query 13 output
   22
   
   
   -- !query 14
   SELECT udf(max(udf(t2.v))) FROM t2
   -- !query 14 schema
   struct<udf(max(udf(v))):string>
   -- !query 14 output
   nan
   
   
   -- !query 15
   SELECT *
   FROM   t1
   WHERE  t1.v <= (SELECT   max(t2.v)
                   FROM     t2
                   WHERE    t2.k = t1.k)
   -- !query 15 schema
   struct<k:string,v:int>
   -- !query 15 output
   one  1
   two  2
   
   
   -- !query 16
   SELECT *
   FROM   t1
   WHERE  t1.v <= (SELECT   udf(max(t2.v))
                   FROM     t2
                   WHERE    t2.k = t1.k)
   -- !query 16 schema
   struct<>
   -- !query 16 output
   java.lang.UnsupportedOperationException
   Cannot evaluate expression: udf(null)
   
   
   -- !query 17
   SELECT *
   FROM   t1
   WHERE  t1.v <= (SELECT   max(udf(t2.v))
                   FROM     t2
                   WHERE    t2.k = t1.k)
   -- !query 17 schema
   struct<k:string,v:int>
   -- !query 17 output
   two  2
   
   
   -- !query 18
   SELECT *
   FROM   t1
   WHERE  t1.v <= (SELECT   udf(max(udf(t2.v)))
                   FROM     t2
                   WHERE    t2.k = t1.k)
   -- !query 18 schema
   struct<>
   -- !query 18 output
   java.lang.UnsupportedOperationException
   Cannot evaluate expression: udf(null)
   ```
   I initially thought ```max(t2.v) returns 22``` is the right behavior, but 
after I looked the implementation of ```functions.max(e: Column)```, I am not 
sure any more. This ```functions.max(e: Column)``` eventually calls this 
   ```nanSafeCompareDoubles```, and it treats NaN greater than any non-NaN 
double. So  ```functions.max(t2.v)``` returns ```NaN```. Is this right?
   Should we let ```max(t2.v) returns 22``` instead of ```NaN```?
   ```
     /**
      * NaN-safe version of `java.lang.Double.compare()` which allows NaN 
values to be compared
      * according to semantics where NaN == NaN and NaN is greater than any 
non-NaN double.
      */
   def nanSafeCompareDoubles(x: Double, y: Double): Int = {
       val xIsNan: Boolean = java.lang.Double.isNaN(x)
       val yIsNan: Boolean = java.lang.Double.isNaN(y)
       if ((xIsNan && yIsNan) || (x == y)) 0
       else if (xIsNan) 1
       else if (yIsNan) -1
       else if (x > y) 1
       else -1
     }
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to