Re: [PR] [SPARK-54314][PYTHON][CONNECT] Improve Server-Side debuggability in Spark Connect by capturing client application's file name and line numbers [spark]

via GitHub Wed, 17 Dec 2025 11:37:52 -0800


gaogaotiantian commented on code in PR #53076:
URL: https://github.com/apache/spark/pull/53076#discussion_r2628307934



##########
python/pyspark/sql/connect/client/core.py:
##########
@@ -607,6 +614,55 @@ def fromProto(cls, pb: pb2.ConfigResponse) -> 
"ConfigResult":
         )
 
 
+def _is_pyspark_source(filename: str) -> bool:

Review Comment:
   I don't think we need this to be a function. It's a single line code.



##########
python/pyspark/sql/connect/client/core.py:
##########
@@ -1273,6 +1329,10 @@ def _execute_plan_request_with_metadata(
                 )
             req.operation_id = operation_id
         self._update_request_with_user_context_extensions(req)
+
+        call_stack_trace = _build_call_stack_trace()
+        if call_stack_trace:

Review Comment:
   The min version supported is 3.10 now so you can do
   
   ```python
   if call_stack_trace := _build_call_stack_trace():
       req.user_context.extensions.append(call_stack_trace)
   ```



##########
python/pyspark/sql/connect/client/core.py:
##########
@@ -607,6 +614,55 @@ def fromProto(cls, pb: pb2.ConfigResponse) -> 
"ConfigResult":
         )
 
 
+def _is_pyspark_source(filename: str) -> bool:
+    """Check if the given filename is from the pyspark package."""
+    return filename.startswith(PYSPARK_ROOT)
+
+
+def _retrieve_stack_frames() -> List[CallSite]:
+    """
+    Return a list of CallSites representing the relevant stack frames in the 
callstack.
+    """
+    frames = traceback.extract_stack()
+
+    filtered_stack_frames = []
+    for i, frame in enumerate(frames):
+        filename, lineno, func, _ = frame
+        if _is_pyspark_source(filename):
+            # Do not include PySpark internal frames as they are not user 
application code
+            break
+        if i + 1 < len(frames):
+            _, _, func, _ = frames[i + 1]
+        filtered_stack_frames.append(CallSite(function=func, file=filename, 
linenum=lineno))
+
+    return filtered_stack_frames
+
+
+def _build_call_stack_trace() -> Optional[any_pb2.Any]:

Review Comment:
   These function are exclusively used by `SparkConnectClient` and it is 
providing information of `SparkConnectClient`. We should put these in the class 
instead of having individual functions in the module (I also believe this is a 
good pattern connect is trying to keep).



##########
python/pyspark/sql/connect/client/core.py:
##########
@@ -607,6 +614,55 @@ def fromProto(cls, pb: pb2.ConfigResponse) -> 
"ConfigResult":
         )
 
 
+def _is_pyspark_source(filename: str) -> bool:
+    """Check if the given filename is from the pyspark package."""
+    return filename.startswith(PYSPARK_ROOT)
+
+
+def _retrieve_stack_frames() -> List[CallSite]:
+    """
+    Return a list of CallSites representing the relevant stack frames in the 
callstack.
+    """
+    frames = traceback.extract_stack()
+
+    filtered_stack_frames = []
+    for i, frame in enumerate(frames):
+        filename, lineno, func, _ = frame
+        if _is_pyspark_source(filename):
+            # Do not include PySpark internal frames as they are not user 
application code
+            break
+        if i + 1 < len(frames):
+            _, _, func, _ = frames[i + 1]
+        filtered_stack_frames.append(CallSite(function=func, file=filename, 
linenum=lineno))

Review Comment:
   I know this is what `first_spark_call` does, but I think this is wrong here. 
In the definition of `StackTraceElement`, the fields are:
   
   ```python
           method_name: builtins.str
           """The name of the method containing the execution point."""
           file_name: builtins.str
           """The name of the file containing the execution point."""
           line_number: builtins.int
           """The line number of the source line containing the execution 
point."""
   ```
   
   `method_name` should be the method/function that contains the execution 
point, instead of the callee of this execution point. I don't think you should 
use the func from the next frame.
   
   If you get rid of this logic, this function would be trivial, that you 
should not need to do a separate function at all. You are iterating through 
frames to build a list of `CallSite` and immediately unpacking it in the caller 
 function. I think you should just do everything in a single function.



##########
python/pyspark/sql/tests/connect/client/test_client_call_stack_trace.py:
##########
@@ -0,0 +1,487 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import unittest
+from unittest.mock import patch
+
+import pyspark
+from pyspark.testing.connectutils import should_test_connect, 
connect_requirement_message
+
+if should_test_connect:
+    import pyspark.sql.connect.proto as pb2
+    from pyspark.sql.connect.client import SparkConnectClient, core
+    from pyspark.sql.connect.client.core import (
+        _is_pyspark_source,
+        _retrieve_stack_frames,
+        _build_call_stack_trace,
+    )
+
+    # The _cleanup_ml_cache invocation will hang in this test (no valid spark 
cluster)
+    # and it blocks the test process exiting because it is registered as the 
atexit handler
+    # in `SparkConnectClient` constructor. To bypass the issue, patch the 
method in the test.
+    SparkConnectClient._cleanup_ml_cache = lambda _: None
+
+# SPARK-54314: Improve Server-Side debuggability in Spark Connect by capturing 
client application's
+# file name and line numbers in PySpark
+# https://issues.apache.org/jira/browse/SPARK-54314
+
+
[email protected](not should_test_connect, connect_requirement_message)

Review Comment:
   When I looked at this code, the first thing came to my mind is - is this 
generated by LLM? Then I went back to the PR description and confirmed it.
   
   I'll try to explain why I don't enjoy this piece.
   
   This is a very large test case, testing a very small function. The test 
itself is a few times larger than the function it's testing. A lot of the stuff 
it tests are trivial - when you read the actual test you'll be like - why am I 
testing this?
   
   People might think - it's harmless to have more tests. Yes tests are good, 
but only good tests are good. Tests are code, and any extra code increases 
effort to maintain. I think we should greatly reduce the number of tests to 
test what really matters. To test against the real potential dark corners 
instead of artificial ones. For example, I don't think any human would write 3 
test methods to test `_is_pyspark_source`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54314][PYTHON][CONNECT] Improve Server-Side debuggability in Spark Connect by capturing client application's file name and line numbers [spark]

Reply via email to