codope commented on issue #3431:
URL: https://github.com/apache/hudi/issues/3431#issuecomment-907163686


   @nochimow Thanks for providing the logs.  For convenience, I am pasting the 
relevant stacktrace below.
   ```
   Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: ShuffleMapStage 18 (countByKey at 
BaseSparkCommitActionExecutor.java:158) has failed the maximum allowable number 
of times: 4. Most recent failure reason: 
org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
/172.35.196.242:35965   at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:554)
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:485)
     at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
      at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)       
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)       at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)       at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)   
     at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIt
 erator.scala:37)       at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)       at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156)
    at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
   at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:153)
   at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
      at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  
     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)   at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)  
     at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:153)    at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)      at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288)     at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at 
org.apache.spark.rdd.RDD.com
 puteOrReadCheckpoint(RDD.scala:324)    at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288)     at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)      at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288)     at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)      at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288)     at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)      at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288)     at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)      at 
org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)     at 
org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)     at 
org.apache.spark.storage.BlockManager$$anon
 fun$doPutIterator$1.apply(BlockManager.scala:1182)     at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) 
        at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)    
     at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)   
     at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)         at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:286)     at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)      at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288)     at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)      at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288)     at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)   at 
org.apache.spark.schedu
 ler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)    at 
org.apache.spark.scheduler.Task.run(Task.scala:121)  at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)    at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
     at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: 
Failed to connect to /172.35.196.242:35965     at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
         at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
         at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:114)
         at org.apache.spark.network.shuffle.Retr
 yingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)    at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda$initiateRetry$0(RetryingBlockFetcher.java:169)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)    
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
     at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    ... 1 more Caused by: 
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 
/172.35.196.242:35965     at sun.nio.ch.SocketChannelImpl.checkConnect(Native 
Method)     at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)       at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
      at io.netty.channel.nio.AbstractNioChannel$Abstr
 actNioUnsafe.finishConnect(AbstractNioChannel.java:340)        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)  at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)    
     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)         at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
         ... 2 more Caused by: java.net.ConnectException: Connection refused    
 ... 11 more 
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1493)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2107)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
        at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$countByKey$1.apply(PairRDDFunctions.scala:370)
        at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$countByKey$1.apply(PairRDDFunctions.scala:370)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at 
org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:369)
        at 
org.apache.spark.api.java.JavaPairRDD.countByKey(JavaPairRDD.scala:312)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.buildProfile(BaseSparkCommitActionExecutor.java:158)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:126)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:78)
        at 
org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:55)
        ... 39 more
   ```
   
   Note that,
   ```
   ShuffleMapStage 18 (countByKey at BaseSparkCommitActionExecutor.java:158) 
has failed the maximum allowable number of times: 4. Most recent failure 
reason: org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
/172.35.196.242:35965
   ```
   I am not sure if that's a Hudi issue. `FetchFailedException` and failure to 
connect typically happens due to memory pressure on the executors. Consider 
tuning the executor memory  and memory overhead parameters.
   
   cc: @nsivabalan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to