LucaCanali opened a new pull request #33559: URL: https://github.com/apache/spark/pull/33559
### What changes are proposed in this pull request? This proposes to add SQLMetrics instrumentation for Python UDF execution. The proposed metrics are: - data sent to Python workers - data returned from Python workers - number of rows processed ### Why are the changes needed? This aims at improving monitoring and performance troubleshooting of Python UDFs. In particular as an aid to answer performance-related questions such as: why is the UDF slow?, how much work it has done so far?, etc. ### Does this PR introduce _any_ user-facing change? SQL metrics are made available in the WEB UI. See the following examples:   ### How was this patch tested? Manually tested + a Python unit test has been added. Example code used for testing: ``` from pyspark.sql.functions import col, pandas_udf import time @pandas_udf("long") def test_pandas(col1): time.sleep(0.02) return col1 * col1 spark.udf.register("test_pandas", test_pandas) spark.sql("select rand(42)*rand(51)*rand(12) col1 from range(10000000)").createOrReplaceTempView("t1") spark.sql("select max(test_pandas(col1)) from t1").collect() ``` This is used to test with more data pushed to the Python workers ``` from pyspark.sql.functions import col, pandas_udf import time @pandas_udf("long") def test_pandas(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17): time.sleep(0.02) return col1 spark.udf.register("test_pandas", test_pandas) spark.sql("select rand(42)*rand(51)*rand(12) col1 from range(10000000)").createOrReplaceTempView("t1") spark.sql("select max(test_pandas(col1,col1+1,col1+2,col1+3,col1+4,col1+5,col1+6,col1+7,col1+8,col1+9,col1+10,col1+11,col1+12,col1+13,col1+14,col1+15,col1+16)) from t1").collect() ``` This is for testing Python UDF (non pandas) `from pyspark.sql.functions import udf; spark.range(100).select(udf(lambda x: x/1)("id")).collect()` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
