HyukjinKwon opened a new pull request, #49000:
URL: https://github.com/apache/spark/pull/49000

   ### What changes were proposed in this pull request?
   
   This PR proposes to add a configuration for async execution for Python UDF 
with Arrow optimization. One usecase is, for exmaple, to have RESTful API 
requests within Python UDF, and it slows down by I/O. By this configuration, 
those I/O requests can happen in parallel.
   
   ### Why are the changes needed?
   
   In order to speed up UDF executions. For example, the code below:
   
   ```python
   spark.conf.set("spark.sql.execution.pythonUDF.arrow.concurrency.level", 10)
   spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", "true")
   
   from pyspark.sql.functions import udf, col
   import time
   
   @udf
   def my_rest_func(x):
       import requests
       requests.get("https://httpbin.org/get";)
   
   
   start_time = time.time()
   _ = spark.range(100).coalesce(1).select(my_rest_func(col("id"))).collect()
   print(time.time() - start_time)
   ```
   
   can be 10x faster.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it adds a new configuration called 
`spark.sql.execution.pythonUDF.arrow.concurrency.level`.
   
   ### How was this patch tested?
   
   Tested as shown above, and unittests were also added.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to