[I] [BUG] Celebon on K8s takes more time to complete [celeborn]

via GitHub Wed, 04 Dec 2024 21:00:49 -0800


avishnus opened a new issue, #2977:
URL: https://github.com/apache/celeborn/issues/2977


   ### What is the bug(with logs or screenshots)?
   I have deployed celeborn on k8s. However while running the job, it takes 7 
hours to complete. Without celeborn it takes only 2 hours.
   
   ### Celeborn worker logs
   `24/12/04 18:18:29,537 ERROR [fetch-server-11-50] FetchHandler: Sending 
ChunkFetchSuccess operation failed, chunk 
StreamChunkSlice[streamId=24095417247,chunkIndex=5,offset=0,len=2147483647]
   java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
        at 
sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
        at 
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
        at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605)
        at 
io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:130)
        at 
org.apache.celeborn.common.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:119)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:369)
        at 
io.netty.channel.nio.AbstractNioByteChannel.doWriteInternal(AbstractNioByteChannel.java:238)
        at 
io.netty.channel.nio.AbstractNioByteChannel.doWrite0(AbstractNioByteChannel.java:212)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:407)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:931)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:366)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:782)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:750)`
   
   `24/12/05 04:48:14,935 ERROR [push-timeout-checker-1] PushDataHandler: 
PushData replication failed for partitionLocation: PartitionLocation[
     id-epoch:2361-0
     
host-rpcPort-pushPort-fetchPort-replicatePort:10.189.190.60-35529-43703-35825-43867
     mode:PRIMARY
     
peer:(host-rpcPort-pushPort-fetchPort-replicatePort:10.186.111.100-34325-35069-34817-34751)
     storage hint:StorageInfo{type=MEMORY, mountPoint='/spark-local2/data', 
finalResult=false, filePath=}
     mapIdBitMap:null]
   org.apache.celeborn.common.exception.CelebornIOException: 
PUSH_DATA_TIMEOUT_REPLICA
        at 
org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredPushRequest(TransportResponseHandler.java:145)
        at 
org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$0(TransportResponseHandler.java:113)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [BUG] Celebon on K8s takes more time to complete [celeborn]

Reply via email to