gaogaotiantian commented on code in PR #53076:
URL: https://github.com/apache/spark/pull/53076#discussion_r2628307934
##########
python/pyspark/sql/connect/client/core.py:
##########
@@ -607,6 +614,55 @@ def fromProto(cls, pb: pb2.ConfigResponse) ->
"ConfigResult":
)
+def _is_pyspark_source(filename: str) -> bool:
Review Comment:
I don't think we need this to be a function. It's a single line code.
##########
python/pyspark/sql/connect/client/core.py:
##########
@@ -1273,6 +1329,10 @@ def _execute_plan_request_with_metadata(
)
req.operation_id = operation_id
self._update_request_with_user_context_extensions(req)
+
+ call_stack_trace = _build_call_stack_trace()
+ if call_stack_trace:
Review Comment:
The min version supported is 3.10 now so you can do
```python
if call_stack_trace := _build_call_stack_trace():
req.user_context.extensions.append(call_stack_trace)
```
##########
python/pyspark/sql/connect/client/core.py:
##########
@@ -607,6 +614,55 @@ def fromProto(cls, pb: pb2.ConfigResponse) ->
"ConfigResult":
)
+def _is_pyspark_source(filename: str) -> bool:
+ """Check if the given filename is from the pyspark package."""
+ return filename.startswith(PYSPARK_ROOT)
+
+
+def _retrieve_stack_frames() -> List[CallSite]:
+ """
+ Return a list of CallSites representing the relevant stack frames in the
callstack.
+ """
+ frames = traceback.extract_stack()
+
+ filtered_stack_frames = []
+ for i, frame in enumerate(frames):
+ filename, lineno, func, _ = frame
+ if _is_pyspark_source(filename):
+ # Do not include PySpark internal frames as they are not user
application code
+ break
+ if i + 1 < len(frames):
+ _, _, func, _ = frames[i + 1]
+ filtered_stack_frames.append(CallSite(function=func, file=filename,
linenum=lineno))
+
+ return filtered_stack_frames
+
+
+def _build_call_stack_trace() -> Optional[any_pb2.Any]:
Review Comment:
These function are exclusively used by `SparkConnectClient` and it is
providing information of `SparkConnectClient`. We should put these in the class
instead of having individual functions in the module (I also believe this is a
good pattern connect is trying to keep).
##########
python/pyspark/sql/connect/client/core.py:
##########
@@ -607,6 +614,55 @@ def fromProto(cls, pb: pb2.ConfigResponse) ->
"ConfigResult":
)
+def _is_pyspark_source(filename: str) -> bool:
+ """Check if the given filename is from the pyspark package."""
+ return filename.startswith(PYSPARK_ROOT)
+
+
+def _retrieve_stack_frames() -> List[CallSite]:
+ """
+ Return a list of CallSites representing the relevant stack frames in the
callstack.
+ """
+ frames = traceback.extract_stack()
+
+ filtered_stack_frames = []
+ for i, frame in enumerate(frames):
+ filename, lineno, func, _ = frame
+ if _is_pyspark_source(filename):
+ # Do not include PySpark internal frames as they are not user
application code
+ break
+ if i + 1 < len(frames):
+ _, _, func, _ = frames[i + 1]
+ filtered_stack_frames.append(CallSite(function=func, file=filename,
linenum=lineno))
Review Comment:
I know this is what `first_spark_call` does, but I think this is wrong here.
In the definition of `StackTraceElement`, the fields are:
```python
method_name: builtins.str
"""The name of the method containing the execution point."""
file_name: builtins.str
"""The name of the file containing the execution point."""
line_number: builtins.int
"""The line number of the source line containing the execution
point."""
```
`method_name` should be the method/function that contains the execution
point, instead of the callee of this execution point. I don't think you should
use the func from the next frame.
If you get rid of this logic, this function would be trivial, that you
should not need to do a separate function at all. You are iterating through
frames to build a list of `CallSite` and immediately unpacking it in the caller
function. I think you should just do everything in a single function.
##########
python/pyspark/sql/tests/connect/client/test_client_call_stack_trace.py:
##########
@@ -0,0 +1,487 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import unittest
+from unittest.mock import patch
+
+import pyspark
+from pyspark.testing.connectutils import should_test_connect,
connect_requirement_message
+
+if should_test_connect:
+ import pyspark.sql.connect.proto as pb2
+ from pyspark.sql.connect.client import SparkConnectClient, core
+ from pyspark.sql.connect.client.core import (
+ _is_pyspark_source,
+ _retrieve_stack_frames,
+ _build_call_stack_trace,
+ )
+
+ # The _cleanup_ml_cache invocation will hang in this test (no valid spark
cluster)
+ # and it blocks the test process exiting because it is registered as the
atexit handler
+ # in `SparkConnectClient` constructor. To bypass the issue, patch the
method in the test.
+ SparkConnectClient._cleanup_ml_cache = lambda _: None
+
+# SPARK-54314: Improve Server-Side debuggability in Spark Connect by capturing
client application's
+# file name and line numbers in PySpark
+# https://issues.apache.org/jira/browse/SPARK-54314
+
+
[email protected](not should_test_connect, connect_requirement_message)
Review Comment:
When I looked at this code, the first thing came to my mind is - is this
generated by LLM? Then I went back to the PR description and confirmed it.
I'll try to explain why I don't enjoy this piece.
This is a very large test case, testing a very small function. The test
itself is a few times larger than the function it's testing. A lot of the stuff
it tests are trivial - when you read the actual test you'll be like - why am I
testing this?
People might think - it's harmless to have more tests. Yes tests are good,
but only good tests are good. Tests are code, and any extra code increases
effort to maintain. I think we should greatly reduce the number of tests to
test what really matters. To test against the real potential dark corners
instead of artificial ones. For example, I don't think any human would write 3
test methods to test `_is_pyspark_source`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]