AngersZhuuuu opened a new pull request, #52919:
URL: https://github.com/apache/spark/pull/52919

   … stuck and stuck stop process
   
   
   
   ### What changes were proposed in this pull request?
   SparkContext stop stuck on ContextCleaner
   
   ```
   25/11/05 18:12:29 ERROR [shutdown-hook-0] ThreadUtils: 
   14 Driver BLOCKED Blocked by Thread 60 
Lock(org.apache.spark.ContextCleaner@1726738661})
     org.apache.spark.ContextCleaner.stop(ContextCleaner.scala:145)
     org.apache.spark.SparkContext.$anonfun$stop$9(SparkContext.scala:2094)
     
org.apache.spark.SparkContext.$anonfun$stop$9$adapted(SparkContext.scala:2094)
     org.apache.spark.SparkContext$$Lambda$5309/807013918.apply(Unknown Source)
     scala.Option.foreach(Option.scala:407)
     org.apache.spark.SparkContext.$anonfun$stop$8(SparkContext.scala:2094)
     org.apache.spark.SparkContext$$Lambda$5308/1445921225.apply$mcV$sp(Unknown 
Source)
     org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1512)
     org.apache.spark.SparkContext.stop(SparkContext.scala:2094)
     org.apache.spark.SparkContext.stop(SparkContext.scala:2050)
     org.apache.spark.sql.SparkSession.stop(SparkSession.scala:718)
     com.shopee.data.content.ods.live_performance.Main$.main(Main.scala:62)
     com.shopee.data.content.ods.live_performance.Main.main(Main.scala)
     sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
     
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     java.lang.reflect.Method.invoke(Method.java:498)
     
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:751)
 
   ```
   ContextCleaner stop() will wait lock
   ```
   def stop(): Unit = {
     stopped = true
     // Interrupt the cleaning thread, but wait until the current task has 
finished before
     // doing so. This guards against the race condition where a cleaning 
thread may
     // potentially clean similarly named variables created by a different 
SparkContext,
     // resulting in otherwise inexplicable block-not-found exceptions 
(SPARK-6132).
     synchronized {
       cleaningThread.interrupt()
     }
     cleaningThread.join()
     periodicGCService.shutdown()
   }
    ```
   
   , but one call on keepCleaning() hold the lock
   
   ```
   25/11/05 18:12:29 ERROR [shutdown-hook-0] ThreadUtils: 
   60 Spark Context Cleaner TIMED_WAITING 
Monitor(org.apache.spark.ContextCleaner@1726738661})
     sun.misc.Unsafe.park(Native Method)
     java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
     
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
     
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
     scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:248)
     scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:258)
     scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
     org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:294)
     org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
     
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:194)
     
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:351)
     
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
     
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:78)
     
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:254)
     
org.apache.spark.ContextCleaner.$anonfun$keepCleaning$3(ContextCleaner.scala:204)
     
org.apache.spark.ContextCleaner.$anonfun$keepCleaning$3$adapted(ContextCleaner.scala:195)
     org.apache.spark.ContextCleaner$$Lambda$1178/1994584033.apply(Unknown 
Source)
     scala.Option.foreach(Option.scala:407)
     
org.apache.spark.ContextCleaner.$anonfun$keepCleaning$1(ContextCleaner.scala:195)
 => holding Monitor(org.apache.spark.ContextCleaner@1726738661})
     
org.apache.spark.ContextCleaner$$Lambda$1109/1496842179.apply$mcV$sp(Unknown 
Source)
     org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1474)
     
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:189)
     org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:79) 
   ```
   BlockManager stuck on removeBroadcast 
RpcUtils.INFINITE_TIMEOUT.awaitResult(future) 【PR 
https://github.com/apache/spark/pull/28924 change here】
   ```
   def removeBroadcast(broadcastId: Long, removeFromMaster: Boolean, blocking: 
Boolean): Unit = {
     val future = driverEndpoint.askSync[Future[Seq[Int]]](
       RemoveBroadcast(broadcastId, removeFromMaster))
     future.failed.foreach(e =>
       logWarning(s"Failed to remove broadcast $broadcastId" +
         s" with removeFromMaster = $removeFromMaster - ${e.getMessage}", e)
     )(ThreadUtils.sameThread)
     if (blocking) {
       // the underlying Futures will timeout anyway, so it's safe to use 
infinite timeout here
       RpcUtils.INFINITE_TIMEOUT.awaitResult(future)
     }
   } 
   ```
   For such case only reason should be RPC was missing handling
   
   Driver OOM or A thread leak in yarn nm prevents the creation of new threads 
to handle RPC.
   ```
   25/11/05 08:16:22 ERROR [metrics-paimon-push-gateway-reporter-2-thread-1] 
ScheduledReporter: Exception thrown from PushGatewayReporter#report. Exception 
was suppressed.
   java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:717)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1115)
        at 
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1388)
        at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1416)
        at 
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1400)
        at 
sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
        at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
        at 
sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:167)
        at 
io.prometheus.client.exporter.PushGateway.doRequest(PushGateway.java:243)
        at io.prometheus.client.exporter.PushGateway.push(PushGateway.java:134)
        at 
org.apache.paimon.metrics.reporter.PushGatewayReporter.report(PushGatewayReporter.java:84)
        at 
com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:253)
        at 
com.codahale.metrics.ScheduledReporter.lambda$start$0(ScheduledReporter.java:182)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748) 
   ```
   
   
   
   then whole process stuck on here
   
   
   ### Why are the changes needed?
   Avoid app stuck
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to