[GitHub] [spark] sunchao commented on a change in pull request #32082: [SPARK-34981][SQL] Implement V2 function resolution and evaluation

GitBox Tue, 27 Apr 2021 12:37:59 -0700


sunchao commented on a change in pull request #32082:
URL: https://github.com/apache/spark/pull/32082#discussion_r621542001




##########
File path: 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ScalarFunction.java
##########
@@ -23,17 +23,67 @@
 /**
  * Interface for a function that produces a result value for each input row.
  * <p>
- * For each input row, Spark will call a produceResult method that corresponds 
to the
- * {@link #inputTypes() input data types}. The expected JVM argument types 
must be the types used by
- * Spark's InternalRow API. If no direct method is found or when not using 
codegen, Spark will call
- * {@link #produceResult(InternalRow)}.
+ * To evaluate each input row, Spark will first try to lookup and use a "magic 
method" (described
+ * below) through Java reflection. If the method is not found, Spark will call
+ * {@link #produceResult(InternalRow)} as a fallback approach.
  * <p>
  * The JVM type of result values produced by this function must be the type 
used by Spark's
  * InternalRow API for the {@link DataType SQL data type} returned by {@link 
#resultType()}.
+ * <p>
+ * <b>IMPORTANT</b>: the default implementation of {@link #produceResult} 
throws
+ * {@link UnsupportedOperationException}. Users can choose to override this 
method, or implement
+ * a "magic method" with name {@link #MAGIC_METHOD_NAME} which takes 
individual parameters
+ * instead of a {@link InternalRow}. The magic method will be loaded by Spark 
through Java
+ * reflection and will also provide better performance in general, due to 
optimizations such as
+ * codegen, removal of Java boxing, etc.
+ *
+ * For example, a scalar UDF for adding two integers can be defined as follow 
with the magic
+ * method approach:
+ *
+ * <pre>
+ *   public class IntegerAdd implements{@code ScalarFunction<Integer>} {
+ *     public int invoke(int left, int right) {

Review comment:
       @cloud-fan I think we can also consider adding another "static invoke" 
API for those stateless UDFs. From the benchmark you did sometime back it seems 
this can give a decent performance improvements. WDYT?
   ```
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.6
   Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
   UDF perf:                                 Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   native add                                        14206          14516       
  535         70.4          14.2       1.0X
   udf add                                           24609          25271       
  898         40.6          24.6       0.6X
   new udf add                                       18657          19096       
  726         53.6          18.7       0.8X
   new row udf add                                   21128          22343       
 1478         47.3          21.1       0.7X
   static udf add                                    16678          16887       
  278         60.0          16.7       0.9X
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on a change in pull request #32082: [SPARK-34981][SQL] Implement V2 function resolution and evaluation

Reply via email to