chenhao-db opened a new pull request, #46809:
URL: https://github.com/apache/spark/pull/46809

   ### What changes were proposed in this pull request?
   
   It is a performance optimization for PySpark and can mitigate the 
performance regression across versions. For the context, `sc._jvm` is a 
`JVMView` object in Py4J. It has an overloaded `__getattr__` implementation 
([source](https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/java_gateway.py#L1741)).
 Accessing `.functions` internally sends a command to the Py4J server. The 
server then searches through a list of imports to find a package that contains 
the `functions` class 
([source](https://github.com/py4j/py4j/blob/master/py4j-java/src/main/java/py4j/reflection/TypeUtil.java#L249)),
 and will eventually find `org.apache.spark.sql.functions`. The failed 
reflection attempts are much more expensive than the last successful 
reflection. Instead, we can directly use this fully qualified class name and 
prevent all failed reflection attempts.
   
   ### Why are the changes needed?
   
   It improves the performance in PySpark when building large `DataFrame`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   Existing tests.
   
   The following code can verify the performance improvement:
   
   ```
   import pyspark.sql.functions as F
   from datetime import datetime
   
   for i in range(5):
     T = datetime.now()
     df = spark.range(0, 10).agg(F.array([F.sum(F.col("id")) for i in range(0, 
500)]))
     print(datetime.now() - T)
   ```
   
   On local PySpark shell, the time consumption before/after the optimization 
is about 1s/0.5s.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to