chenhao-db opened a new pull request, #46809: URL: https://github.com/apache/spark/pull/46809
### What changes were proposed in this pull request? It is a performance optimization for PySpark and can mitigate the performance regression across versions. For the context, `sc._jvm` is a `JVMView` object in Py4J. It has an overloaded `__getattr__` implementation ([source](https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/java_gateway.py#L1741)). Accessing `.functions` internally sends a command to the Py4J server. The server then searches through a list of imports to find a package that contains the `functions` class ([source](https://github.com/py4j/py4j/blob/master/py4j-java/src/main/java/py4j/reflection/TypeUtil.java#L249)), and will eventually find `org.apache.spark.sql.functions`. The failed reflection attempts are much more expensive than the last successful reflection. Instead, we can directly use this fully qualified class name and prevent all failed reflection attempts. ### Why are the changes needed? It improves the performance in PySpark when building large `DataFrame`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. The following code can verify the performance improvement: ``` import pyspark.sql.functions as F from datetime import datetime for i in range(5): T = datetime.now() df = spark.range(0, 10).agg(F.array([F.sum(F.col("id")) for i in range(0, 500)])) print(datetime.now() - T) ``` On local PySpark shell, the time consumption before/after the optimization is about 1s/0.5s. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
