(spark) branch branch-4.0 updated: [SPARK-51242][CONENCT][PYTHON] Improve Column performance when DQC is disabled

gurwls223 Tue, 18 Feb 2025 15:57:15 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-4.0
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-4.0 by this push:
     new 1350986b53ac [SPARK-51242][CONENCT][PYTHON] Improve Column performance 
when DQC is disabled
1350986b53ac is described below

commit 1350986b53ace04f2d32724f4b5abf0edb4e231f
Author: Haejoon Lee <[email protected]>
AuthorDate: Wed Feb 19 08:40:08 2025 +0900

    [SPARK-51242][CONENCT][PYTHON] Improve Column performance when DQC is 
disabled
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to  improve Column performance when 
DQC(DataFrameQueryContext) is disabled by delaying to call `getActiveSession` 
which is pretty expensive.
    
    ### Why are the changes needed?
    
    To improve the performance of Column operations.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, API changes but only improves the performance
    
    ### How was this patch tested?
    
    Manually tested, and also the existing CI should pass.
    
    ```python
    >>> spark.conf.get("spark.python.sql.dataFrameDebugging.enabled")
    'false'
    ```
    
    **Before fix**
    ```python
    >>> import time
    >>> import pyspark.sql.functions as F
    >>>
    >>> c = F.col("name")
    >>> start = time.time()
    >>> for i in range(10000):
    ...   _ = c.alias("a")
    ...
    >>> print(time.time() - start)
    2.061354875564575
    ```
    
    **After fix**
    ```python
    >>> import time
    >>> import pyspark.sql.functions as F
    >>>
    >>> c = F.col("name")
    >>> start = time.time()
    >>> for i in range(10000):
    ...   _ = c.alias("a")
    ...
    >>> print(time.time() - start)
    0.8050589561462402
    ```
    
    And there is no difference when the flag is on:
    
    ```python
    >>> spark.conf.get("spark.python.sql.dataFrameDebugging.enabled")
    'true'
    ```
    
    **Before fix**
    ```python
    >>> import time
    >>> import pyspark.sql.functions as F
    >>>
    >>> c = F.col("name")
    >>> start = time.time()
    >>> for i in range(10000):
    ...   _ = c.alias("a")
    ...
    >>> print(time.time() - start)
    3.755108118057251
    ```
    
    **After fix**
    ```python
    >>> import time
    >>> import pyspark.sql.functions as F
    >>>
    >>> c = F.col("name")
    >>> start = time.time()
    >>> for i in range(10000):
    ...   _ = c.alias("a")
    ...
    >>> print(time.time() - start)
    3.6577670574188232
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #49982 from itholic/DQC_improvement.
    
    Authored-by: Haejoon Lee <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    (cherry picked from commit 489ba0d24975096c2953a6afbea7adb65b3ed4db)
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 python/pyspark/errors/utils.py | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/errors/utils.py b/python/pyspark/errors/utils.py
index 0d01cbb961bb..053b2c91370e 100644
--- a/python/pyspark/errors/utils.py
+++ b/python/pyspark/errors/utils.py
@@ -255,9 +255,7 @@ def _with_origin(func: FuncT) -> FuncT:
         from pyspark.sql import SparkSession
         from pyspark.sql.utils import is_remote
 
-        spark = SparkSession.getActiveSession()
-
-        if spark is not None and hasattr(func, "__name__") and 
is_debugging_enabled():
+        if hasattr(func, "__name__") and is_debugging_enabled():
             if is_remote():
                 # Getting the configuration requires RPC call. Uses the 
default value for now.
                 depth = 1
@@ -268,6 +266,9 @@ def _with_origin(func: FuncT) -> FuncT:
                 finally:
                     set_current_origin(None, None)
             else:
+                spark = SparkSession.getActiveSession()
+                if spark is None:
+                    return func(*args, **kwargs)
                 assert spark._jvm is not None
                 jvm_pyspark_origin = getattr(
                     spark._jvm, 
"org.apache.spark.sql.catalyst.trees.PySparkCurrentOrigin"


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch branch-4.0 updated: [SPARK-51242][CONENCT][PYTHON] Improve Column performance when DQC is disabled

Reply via email to