BalaMahesh commented on issue #4230:
URL: https://github.com/apache/hudi/issues/4230#issuecomment-1114515094

   Hello @yihua ,
   
   We are observing the same behaviour in the below scenario too. Where 
executor pods(running on  k8's)  are dying and after max retry attempts for the 
lost block, driver is killing all the executor pods and getting stuck without 
terminating. Because of this, spark-operator is not restarting the hudi 
application. Will this PR also fix this scenario ?
   
   Stack traces : 
   
   22/04/29 19:09:36 INFO pool-32-thread-1 DAGScheduler: Job 907 failed: sum at 
DeltaSync.java:557, took 1588.119949 s
   22/04/29 19:09:36 ERROR pool-32-thread-1 HoodieDeltaStreamer: Shutting down 
delta-sync due to exception
   org.apache.spark.SparkException: Job aborted due to stage failure: 
ResultStage 1221 (sum at DeltaSync.java:557) has failed the maximum allowable 
number of times: 4. Most recent failure reason: 
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 189   at 
org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$2(MapOutputTracker.scala:1013)
 
   
   22/04/29 19:09:36 INFO pool-32-thread-1 DeltaSync: Shutting down embedded 
timeline server
   22/04/29 19:09:36 INFO pool-32-thread-1 EmbeddedTimelineService: Closing 
Timeline server
   22/04/29 19:09:36 INFO pool-32-thread-1 TimelineService: Closing Timeline 
Service
   22/04/29 19:09:36 INFO pool-32-thread-1 Javalin: Stopping Javalin ...
   22/04/29 19:09:36 INFO pool-32-thread-1 Javalin: Javalin has stopped
   22/04/29 19:09:36 INFO main SparkUI: Stopped Spark web UI at 
http://spark-db4e548074c3c6b1-driver-svc.spark.svc:4040
   22/04/29 19:09:36 INFO pool-32-thread-1 TimelineService: Closed Timeline 
Service
   22/04/29 19:09:36 INFO pool-32-thread-1 EmbeddedTimelineService: Closed 
Timeline server
   22/04/29 19:09:36 INFO main KubernetesClusterSchedulerBackend: Shutting down 
all executors
   22/04/29 19:09:36 INFO dispatcher-CoarseGrainedScheduler 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
executor to shut down
   22/04/29 19:09:36 WARN main ExecutorPodsWatchSnapshotSource: Kubernetes 
client has been closed (this is expected if the application is shutting down.)
   22/04/29 19:09:40 INFO dispatcher-event-loop-0 
MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
   22/04/29 19:09:40 INFO main MemoryStore: MemoryStore cleared
   22/04/29 19:09:40 INFO main BlockManager: BlockManager stopped
   22/04/29 19:09:40 INFO main BlockManagerMaster: BlockManagerMaster stopped
   22/04/29 19:09:40 INFO dispatcher-event-loop-0 
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
   22/04/29 19:09:40 INFO main SparkContext: Successfully stopped SparkContext
   Exception in thread "main" org.apache.hudi.exception.HoodieException: 
org.apache.hudi.exception.HoodieException: Job aborted due to stage failure: 
ResultStage 1221 (sum at DeltaSync.java:557) has failed the maximum allowable 
number of times: 4. Most recent failure reason: 
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 189   at 
org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$2(MapOutputTracker.scala:1013)
        at 
org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$2$adapted(MapOutputTracker.scala:1009)
        at scala.collection.Iterator.foreach(Iterator.scala:941)        at 
scala.collection.Iterator.foreach$(Iterator.scala:941)       at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1429)       at 
org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1009)
   at 
org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:821)
  at org.apache.spark.shuffle.sort.SortShuffleManager.
 getReader(SortShuffleManager.scala:133)        at 
org.apache.spark.shuffle.ShuffleManager.getReader(ShuffleManager.scala:63)   at 
org.apache.spark.shuffle.ShuffleManager.getReader$(ShuffleManager.scala:57)  at 
org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:73)
      at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)      
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at 
org.apache.spark.rdd.RDD.computeOrReadCh
 eckpoint(RDD.scala:373)        at 
org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)      at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440)
      at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
  at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)    
     at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)  
     at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)         at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:335)     at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)   at 
org.apache.spark.scheduler.Task.run(Task.scala:131)  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
     at org.apache.sp
 ark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)  at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)    
     at java.base/java.lang.Thread.run(Unknown Source) 
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:184)
        at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:179)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:514)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
Source)
        at java.base/java.lang.reflect.Method.invoke(Unknown Source)
        at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   
   
   Even after all these above logs, the hudi application get stuck and doesn't 
restart or process any data. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to