zhengruifeng commented on code in PR #40607:
URL: https://github.com/apache/spark/pull/40607#discussion_r1159434752


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala:
##########
@@ -52,13 +52,18 @@ class SparkConnectStreamHandler(responseObserver: 
StreamObserver[ExecutePlanResp
     session.withActive {
 
       // Add debug information to the query execution so that the jobs are 
traceable.
-      val debugString = v.toString
-      session.sparkContext.setLocalProperty(
-        "callSite.short",
-        s"Spark Connect - ${StringUtils.abbreviate(debugString, 128)}")
-      session.sparkContext.setLocalProperty(
-        "callSite.long",
-        StringUtils.abbreviate(debugString, 2048))
+      try {

Review Comment:
   ```
   Started distributed training with 2 executor processes
   java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOfRange(Arrays.java:3664)
        at java.lang.String.<init>(String.java:207)
        at java.lang.StringBuilder.toString(StringBuilder.java:407)
        at 
org.sparkproject.connect.protobuf.TextFormatEscaper.escapeBytes(TextFormatEscaper.java:112)
        at 
org.sparkproject.connect.protobuf.TextFormatEscaper.escapeBytes(TextFormatEscaper.java:119)
        at 
org.sparkproject.connect.protobuf.TextFormat.escapeBytes(TextFormat.java:2364)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:593)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
   Extracting 
/home/ruifeng.zheng/spark/python/target/50b3d81f-67a9-46de-9beb-59b733c16e54/tmp72g322ys/MNIST/raw/t10k-labels-idx1-ubyte.gz
 to 
/home/ruifeng.zheng/spark/python/target/50b3d81f-67a9-46de-9beb-59b733c16e54/tmp72g322ys/MNIST/raw
   ```
   
   `v.toString` is keeping OOM in 
`TorchDistributorDistributedUnitTestsOnConnect`.
   This OOM is related to OS or Java Version, it was thrown in Linux+Java8, but 
doesn't emerge in my local env (macos+java11).
   
   
   The GA resources for free usage is limited to 2U 6G (confirmed with @Yikun), 
and I can not allocate enough driver memory for distributed pytorch training 
without this fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to