zhengruifeng commented on code in PR #40607:
URL: https://github.com/apache/spark/pull/40607#discussion_r1159434752


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala:
##########
@@ -52,13 +52,18 @@ class SparkConnectStreamHandler(responseObserver: 
StreamObserver[ExecutePlanResp
     session.withActive {
 
       // Add debug information to the query execution so that the jobs are 
traceable.
-      val debugString = v.toString
-      session.sparkContext.setLocalProperty(
-        "callSite.short",
-        s"Spark Connect - ${StringUtils.abbreviate(debugString, 128)}")
-      session.sparkContext.setLocalProperty(
-        "callSite.long",
-        StringUtils.abbreviate(debugString, 2048))
+      try {

Review Comment:
   ```
   Started distributed training with 2 executor processes
   java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOfRange(Arrays.java:3664)
        at java.lang.String.<init>(String.java:207)
        at java.lang.StringBuilder.toString(StringBuilder.java:407)
        at 
org.sparkproject.connect.protobuf.TextFormatEscaper.escapeBytes(TextFormatEscaper.java:112)
        at 
org.sparkproject.connect.protobuf.TextFormatEscaper.escapeBytes(TextFormatEscaper.java:119)
        at 
org.sparkproject.connect.protobuf.TextFormat.escapeBytes(TextFormat.java:2364)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:593)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
        at 
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
   Extracting 
/home/ruifeng.zheng/spark/python/target/50b3d81f-67a9-46de-9beb-59b733c16e54/tmp72g322ys/MNIST/raw/t10k-labels-idx1-ubyte.gz
 to 
/home/ruifeng.zheng/spark/python/target/50b3d81f-67a9-46de-9beb-59b733c16e54/tmp72g322ys/MNIST/raw
   ```
   
   `v.toString` is keeping throwing OOM in 
`TorchDistributorDistributedUnitTestsOnConnect`.
   This OOM seems related to Java Version, it was thrown in both Linux+Java8 
and MacOS+Java8, but doesn't emerge in MacOS+Java11.
   
   
   The GA resources for free usage is limited to 2U 6G (confirmed with @Yikun), 
and I believe we cannot allocate enough driver memory for this distributed 
pytorch training UT without this fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to