(spark) branch master updated: [SPARK-50684][PYTHON] Improve Py4J performance in DataFrameQueryContext

gurwls223 Fri, 27 Dec 2024 00:23:03 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 9297c5d6e90a [SPARK-50684][PYTHON] Improve Py4J performance in 
DataFrameQueryContext
9297c5d6e90a is described below

commit 9297c5d6e90a089fc71b7edea64938ffb0890242
Author: Hyukjin Kwon <[email protected]>
AuthorDate: Fri Dec 27 17:22:46 2024 +0900

    [SPARK-50684][PYTHON] Improve Py4J performance in DataFrameQueryContext
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to improve Py4J performance in DataFrameQueryContext by 
reducing the number of Py4J calls. The same logic in 
https://github.com/apache/spark/pull/46809 applies here.
    
    ### Why are the changes needed?
    
    In order to remove the overhead of Py4J, and speed up.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it improves the performance of `DataFrameQueryContext`.
    
    ### How was this patch tested?
    
    Manually via:
    
    ```python
    import cProfile
    
    from pyspark.sql.functions import col
    
    def foo():
        for _ in range(1000):
            col("id")
    
    cProfile.run('foo()', sort='tottime')
    ```
    
    **Before:**
    
    ```
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        27014    2.008    0.000    2.008    0.000 {method 'recv_into' of 
'_socket.socket' objects}
    1009/1000    0.421    0.000    0.861    0.001 inspect.py:969(getmodule)
       969976    0.138    0.000    0.173    0.000 inspect.py:283(ismodule)
        27014    0.128    0.000    0.128    0.000 {method 'sendall' of 
'_socket.socket' objects}
    968691/968655    0.108    0.000    0.110    0.000 {built-in method 
builtins.hasattr}
        27014    0.078    0.000    2.340    0.000 
clientserver.py:523(send_command)
    ```
    
    **After:**
    
    ```
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        21014    1.423    0.000    1.423    0.000 {method 'recv_into' of 
'_socket.socket' objects}
    1009/1000    0.423    0.000    0.851    0.001 inspect.py:969(getmodule)
       969976    0.137    0.000    0.171    0.000 inspect.py:283(ismodule)
        21014    0.117    0.000    0.117    0.000 {method 'sendall' of 
'_socket.socket' objects}
    968691/968655    0.104    0.000    0.106    0.000 {built-in method 
builtins.hasattr}
         8002    0.066    0.000    0.066    0.000 {built-in method 
builtins.next}
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #49312 from HyukjinKwon/improve-performance-origin.
    
    Authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 python/pyspark/errors/utils.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/errors/utils.py b/python/pyspark/errors/utils.py
index f9f60637bd57..d928afc813a4 100644
--- a/python/pyspark/errors/utils.py
+++ b/python/pyspark/errors/utils.py
@@ -270,8 +270,8 @@ def _with_origin(func: FuncT) -> FuncT:
                     set_current_origin(None, None)
             else:
                 assert spark._jvm is not None
-                jvm_pyspark_origin = (
-                    
spark._jvm.org.apache.spark.sql.catalyst.trees.PySparkCurrentOrigin
+                jvm_pyspark_origin = getattr(
+                    spark._jvm, 
"org.apache.spark.sql.catalyst.trees.PySparkCurrentOrigin"
                 )
                 depth = int(
                     spark.conf.get(  # type: ignore[arg-type]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-50684][PYTHON] Improve Py4J performance in DataFrameQueryContext

Reply via email to