BalaMahesh commented on issue #4230: URL: https://github.com/apache/hudi/issues/4230#issuecomment-1114515094
Hello @yihua , We are observing the same behaviour in the below scenario too. Where executor pods(running on k8's) are dying and after max retry attempts for the lost block, driver is killing all the executor pods and getting stuck without terminating. Because of this, spark-operator is not restarting the hudi application. Will this PR also fix this scenario ? Stack traces : 22/04/29 19:09:36 INFO pool-32-thread-1 DAGScheduler: Job 907 failed: sum at DeltaSync.java:557, took 1588.119949 s 22/04/29 19:09:36 ERROR pool-32-thread-1 HoodieDeltaStreamer: Shutting down delta-sync due to exception org.apache.spark.SparkException: Job aborted due to stage failure: ResultStage 1221 (sum at DeltaSync.java:557) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 189 at org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$2(MapOutputTracker.scala:1013) 22/04/29 19:09:36 INFO pool-32-thread-1 DeltaSync: Shutting down embedded timeline server 22/04/29 19:09:36 INFO pool-32-thread-1 EmbeddedTimelineService: Closing Timeline server 22/04/29 19:09:36 INFO pool-32-thread-1 TimelineService: Closing Timeline Service 22/04/29 19:09:36 INFO pool-32-thread-1 Javalin: Stopping Javalin ... 22/04/29 19:09:36 INFO pool-32-thread-1 Javalin: Javalin has stopped 22/04/29 19:09:36 INFO main SparkUI: Stopped Spark web UI at http://spark-db4e548074c3c6b1-driver-svc.spark.svc:4040 22/04/29 19:09:36 INFO pool-32-thread-1 TimelineService: Closed Timeline Service 22/04/29 19:09:36 INFO pool-32-thread-1 EmbeddedTimelineService: Closed Timeline server 22/04/29 19:09:36 INFO main KubernetesClusterSchedulerBackend: Shutting down all executors 22/04/29 19:09:36 INFO dispatcher-CoarseGrainedScheduler KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down 22/04/29 19:09:36 WARN main ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.) 22/04/29 19:09:40 INFO dispatcher-event-loop-0 MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 22/04/29 19:09:40 INFO main MemoryStore: MemoryStore cleared 22/04/29 19:09:40 INFO main BlockManager: BlockManager stopped 22/04/29 19:09:40 INFO main BlockManagerMaster: BlockManagerMaster stopped 22/04/29 19:09:40 INFO dispatcher-event-loop-0 OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 22/04/29 19:09:40 INFO main SparkContext: Successfully stopped SparkContext Exception in thread "main" org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: Job aborted due to stage failure: ResultStage 1221 (sum at DeltaSync.java:557) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 189 at org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$2(MapOutputTracker.scala:1013) at org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$2$adapted(MapOutputTracker.scala:1009) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1009) at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:821) at org.apache.spark.shuffle.sort.SortShuffleManager. getReader(SortShuffleManager.scala:133) at org.apache.spark.shuffle.ShuffleManager.getReader(ShuffleManager.scala:63) at org.apache.spark.shuffle.ShuffleManager.getReader$(ShuffleManager.scala:57) at org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:73) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCh eckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384) at org.apache.spark.rdd.RDD.iterator(RDD.scala:335) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) at org.apache.sp ark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:184) at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:179) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:514) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Even after all these above logs, the hudi application get stuck and doesn't restart or process any data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
