This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 9297c5d6e90a [SPARK-50684][PYTHON] Improve Py4J performance in
DataFrameQueryContext
9297c5d6e90a is described below
commit 9297c5d6e90a089fc71b7edea64938ffb0890242
Author: Hyukjin Kwon <[email protected]>
AuthorDate: Fri Dec 27 17:22:46 2024 +0900
[SPARK-50684][PYTHON] Improve Py4J performance in DataFrameQueryContext
### What changes were proposed in this pull request?
This PR proposes to improve Py4J performance in DataFrameQueryContext by
reducing the number of Py4J calls. The same logic in
https://github.com/apache/spark/pull/46809 applies here.
### Why are the changes needed?
In order to remove the overhead of Py4J, and speed up.
### Does this PR introduce _any_ user-facing change?
Yes, it improves the performance of `DataFrameQueryContext`.
### How was this patch tested?
Manually via:
```python
import cProfile
from pyspark.sql.functions import col
def foo():
for _ in range(1000):
col("id")
cProfile.run('foo()', sort='tottime')
```
**Before:**
```
ncalls tottime percall cumtime percall filename:lineno(function)
27014 2.008 0.000 2.008 0.000 {method 'recv_into' of
'_socket.socket' objects}
1009/1000 0.421 0.000 0.861 0.001 inspect.py:969(getmodule)
969976 0.138 0.000 0.173 0.000 inspect.py:283(ismodule)
27014 0.128 0.000 0.128 0.000 {method 'sendall' of
'_socket.socket' objects}
968691/968655 0.108 0.000 0.110 0.000 {built-in method
builtins.hasattr}
27014 0.078 0.000 2.340 0.000
clientserver.py:523(send_command)
```
**After:**
```
ncalls tottime percall cumtime percall filename:lineno(function)
21014 1.423 0.000 1.423 0.000 {method 'recv_into' of
'_socket.socket' objects}
1009/1000 0.423 0.000 0.851 0.001 inspect.py:969(getmodule)
969976 0.137 0.000 0.171 0.000 inspect.py:283(ismodule)
21014 0.117 0.000 0.117 0.000 {method 'sendall' of
'_socket.socket' objects}
968691/968655 0.104 0.000 0.106 0.000 {built-in method
builtins.hasattr}
8002 0.066 0.000 0.066 0.000 {built-in method
builtins.next}
```
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #49312 from HyukjinKwon/improve-performance-origin.
Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
python/pyspark/errors/utils.py | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/python/pyspark/errors/utils.py b/python/pyspark/errors/utils.py
index f9f60637bd57..d928afc813a4 100644
--- a/python/pyspark/errors/utils.py
+++ b/python/pyspark/errors/utils.py
@@ -270,8 +270,8 @@ def _with_origin(func: FuncT) -> FuncT:
set_current_origin(None, None)
else:
assert spark._jvm is not None
- jvm_pyspark_origin = (
-
spark._jvm.org.apache.spark.sql.catalyst.trees.PySparkCurrentOrigin
+ jvm_pyspark_origin = getattr(
+ spark._jvm,
"org.apache.spark.sql.catalyst.trees.PySparkCurrentOrigin"
)
depth = int(
spark.conf.get( # type: ignore[arg-type]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]