This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch branch-4.0
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-4.0 by this push:
new 1350986b53ac [SPARK-51242][CONENCT][PYTHON] Improve Column performance
when DQC is disabled
1350986b53ac is described below
commit 1350986b53ace04f2d32724f4b5abf0edb4e231f
Author: Haejoon Lee <[email protected]>
AuthorDate: Wed Feb 19 08:40:08 2025 +0900
[SPARK-51242][CONENCT][PYTHON] Improve Column performance when DQC is
disabled
### What changes were proposed in this pull request?
This PR proposes to improve Column performance when
DQC(DataFrameQueryContext) is disabled by delaying to call `getActiveSession`
which is pretty expensive.
### Why are the changes needed?
To improve the performance of Column operations.
### Does this PR introduce _any_ user-facing change?
No, API changes but only improves the performance
### How was this patch tested?
Manually tested, and also the existing CI should pass.
```python
>>> spark.conf.get("spark.python.sql.dataFrameDebugging.enabled")
'false'
```
**Before fix**
```python
>>> import time
>>> import pyspark.sql.functions as F
>>>
>>> c = F.col("name")
>>> start = time.time()
>>> for i in range(10000):
... _ = c.alias("a")
...
>>> print(time.time() - start)
2.061354875564575
```
**After fix**
```python
>>> import time
>>> import pyspark.sql.functions as F
>>>
>>> c = F.col("name")
>>> start = time.time()
>>> for i in range(10000):
... _ = c.alias("a")
...
>>> print(time.time() - start)
0.8050589561462402
```
And there is no difference when the flag is on:
```python
>>> spark.conf.get("spark.python.sql.dataFrameDebugging.enabled")
'true'
```
**Before fix**
```python
>>> import time
>>> import pyspark.sql.functions as F
>>>
>>> c = F.col("name")
>>> start = time.time()
>>> for i in range(10000):
... _ = c.alias("a")
...
>>> print(time.time() - start)
3.755108118057251
```
**After fix**
```python
>>> import time
>>> import pyspark.sql.functions as F
>>>
>>> c = F.col("name")
>>> start = time.time()
>>> for i in range(10000):
... _ = c.alias("a")
...
>>> print(time.time() - start)
3.6577670574188232
```
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #49982 from itholic/DQC_improvement.
Authored-by: Haejoon Lee <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 489ba0d24975096c2953a6afbea7adb65b3ed4db)
Signed-off-by: Hyukjin Kwon <[email protected]>
---
python/pyspark/errors/utils.py | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/python/pyspark/errors/utils.py b/python/pyspark/errors/utils.py
index 0d01cbb961bb..053b2c91370e 100644
--- a/python/pyspark/errors/utils.py
+++ b/python/pyspark/errors/utils.py
@@ -255,9 +255,7 @@ def _with_origin(func: FuncT) -> FuncT:
from pyspark.sql import SparkSession
from pyspark.sql.utils import is_remote
- spark = SparkSession.getActiveSession()
-
- if spark is not None and hasattr(func, "__name__") and
is_debugging_enabled():
+ if hasattr(func, "__name__") and is_debugging_enabled():
if is_remote():
# Getting the configuration requires RPC call. Uses the
default value for now.
depth = 1
@@ -268,6 +266,9 @@ def _with_origin(func: FuncT) -> FuncT:
finally:
set_current_origin(None, None)
else:
+ spark = SparkSession.getActiveSession()
+ if spark is None:
+ return func(*args, **kwargs)
assert spark._jvm is not None
jvm_pyspark_origin = getattr(
spark._jvm,
"org.apache.spark.sql.catalyst.trees.PySparkCurrentOrigin"
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]