zhengruifeng commented on code in PR #40607:
URL: https://github.com/apache/spark/pull/40607#discussion_r1159434752
##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala:
##########
@@ -52,13 +52,18 @@ class SparkConnectStreamHandler(responseObserver:
StreamObserver[ExecutePlanResp
session.withActive {
// Add debug information to the query execution so that the jobs are
traceable.
- val debugString = v.toString
- session.sparkContext.setLocalProperty(
- "callSite.short",
- s"Spark Connect - ${StringUtils.abbreviate(debugString, 128)}")
- session.sparkContext.setLocalProperty(
- "callSite.long",
- StringUtils.abbreviate(debugString, 2048))
+ try {
Review Comment:
```
Started distributed training with 2 executor processes
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at
org.sparkproject.connect.protobuf.TextFormatEscaper.escapeBytes(TextFormatEscaper.java:112)
at
org.sparkproject.connect.protobuf.TextFormatEscaper.escapeBytes(TextFormatEscaper.java:119)
at
org.sparkproject.connect.protobuf.TextFormat.escapeBytes(TextFormat.java:2364)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:593)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printSingleField(TextFormat.java:752)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printField(TextFormat.java:457)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printMessage(TextFormat.java:714)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.print(TextFormat.java:367)
at
org.sparkproject.connect.protobuf.TextFormat$Printer.printFieldValue(TextFormat.java:606)
Extracting
/home/ruifeng.zheng/spark/python/target/50b3d81f-67a9-46de-9beb-59b733c16e54/tmp72g322ys/MNIST/raw/t10k-labels-idx1-ubyte.gz
to
/home/ruifeng.zheng/spark/python/target/50b3d81f-67a9-46de-9beb-59b733c16e54/tmp72g322ys/MNIST/raw
```
`v.toString` is keeping OOM in
`TorchDistributorDistributedUnitTestsOnConnect`.
This OOM is related to OS or Java Version, it was thrown in Linux+Java8, but
doesn't emerge in my local env (macos+java11).
The GA resources for free usage is limited to 2U 6G (confirmed with @Yikun),
and I believe we cannot allocate enough driver memory for distributed pytorch
training without this fix.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]