LantaoJin opened a new pull request #29558: URL: https://github.com/apache/spark/pull/29558
### What changes were proposed in this pull request? Remove all pieces of the broadcast when failed to store some pieces of a broadcast to avoid memory leak. In TorrentBroadcast.scala ```scala L133: if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, tellMaster = false)) L137: TorrentBroadcast.blockifyObject(value, blockSize, SparkEnv.get.serializer, compressionCodec) L147: if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, tellMaster = true)) ``` After the original value is saved successfully(TorrentBroadcast.scala: L133), but the following `blockifyObject()`(L137) or store piece(L147) steps are failed. There is no opportunity to release broadcast from memory. ### Why are the changes needed? We use Spark thrift-server as a long-running service. A bad query submitted a heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the bad query but we found the driver's memory usage was still high and full GCs had very frequency. By investigating with GC dump and log, we found the broadcast may memory leak. 2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure) 2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc): 116G->112G(170G), 184.9121920 secs] [Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)] num #instances #bytes class name ---------------------------------------------- 1: 676531691 72035438432 [B 2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow 3: 99551 12018117568 [Ljava.lang.Object; 4: 26570 4349629040 [I 5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow; 6: 1708819 256299456 [C 7: 2338 179615208 [J 8: 1703669 54517408 java.lang.String 9: 103860 34896960 org.apache.spark.status.TaskDataWrapper 10: 177396 25545024 java.net.URI ... ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test. This UT is hard to write and the patch is straightforward. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
