avishnus opened a new issue, #2977:
URL: https://github.com/apache/celeborn/issues/2977
### What is the bug(with logs or screenshots)?
I have deployed celeborn on k8s. However while running the job, it takes 7
hours to complete. Without celeborn it takes only 2 hours.
### Celeborn worker logs
`24/12/04 18:18:29,537 ERROR [fetch-server-11-50] FetchHandler: Sending
ChunkFetchSuccess operation failed, chunk
StreamChunkSlice[streamId=24095417247,chunkIndex=5,offset=0,len=2147483647]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at
sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605)
at
io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:130)
at
org.apache.celeborn.common.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:119)
at
io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:369)
at
io.netty.channel.nio.AbstractNioByteChannel.doWriteInternal(AbstractNioByteChannel.java:238)
at
io.netty.channel.nio.AbstractNioByteChannel.doWrite0(AbstractNioByteChannel.java:212)
at
io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:407)
at
io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:931)
at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:366)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:782)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:750)`
`24/12/05 04:48:14,935 ERROR [push-timeout-checker-1] PushDataHandler:
PushData replication failed for partitionLocation: PartitionLocation[
id-epoch:2361-0
host-rpcPort-pushPort-fetchPort-replicatePort:10.189.190.60-35529-43703-35825-43867
mode:PRIMARY
peer:(host-rpcPort-pushPort-fetchPort-replicatePort:10.186.111.100-34325-35069-34817-34751)
storage hint:StorageInfo{type=MEMORY, mountPoint='/spark-local2/data',
finalResult=false, filePath=}
mapIdBitMap:null]
org.apache.celeborn.common.exception.CelebornIOException:
PUSH_DATA_TIMEOUT_REPLICA
at
org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredPushRequest(TransportResponseHandler.java:145)
at
org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$0(TransportResponseHandler.java:113)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]