[ https://issues.apache.org/jira/browse/RATIS-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsz-wo Sze reassigned RATIS-2323: --------------------------------- Assignee: gaoyajun02 > Extend ratis-shell add command > ------------------------------ > > Key: RATIS-2323 > URL: https://issues.apache.org/jira/browse/RATIS-2323 > Project: Ratis > Issue Type: Improvement > Components: shell > Reporter: gaoyajun02 > Assignee: gaoyajun02 > Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > In the Celeborn master cluster, the Leader's clientAddress and adminAddress > are cached in Followers and used as the RPC endpoint for handling client > requests. When a Follower receives a client request, since only the Leader > can process client requests, the Follower returns the Leader's RPC endpoint > to the client, allowing the client to resend the request directly to the > Leader. > When expanding the master cluster, we currently use the ratis-shell's add > operation. However, peers added through this operation lack clientAddress and > adminAddress settings. If a newly added peer becomes the Leader, all > Followers will return an empty address, causing clients to access an > incorrect Leader address (127.0.0.1). > {code:java} > 25/08/26 15:47:45,534 INFO [main] MasterClient: connect to master > zw06-data-k8s-sparktest-node007.mt:9097. > 25/08/26 15:47:45,669 WARN [celeborn-netty-rpc-connection-executor-1] > TransportClientFactory: Retry create client, times 1/3 with error: Failed to > connect to /127.0.0.1:0 > org.apache.celeborn.common.exception.CelebornIOException: Failed to connect > to /127.0.0.1:0 > at > org.apache.celeborn.common.network.client.TransportClientFactory.internalCreateClient(TransportClientFactory.java:317) > at > org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:252) > at > org.apache.celeborn.common.network.client.TransportClientFactory.retryCreateClient(TransportClientFactory.java:159) > at > org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:147) > at > org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:259) > at > org.apache.celeborn.common.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:234) > at > org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194) > at > org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: > finishConnect(..) failed: Connection refused: /127.0.0.1:0 > Caused by: java.net.ConnectException: finishConnect(..) failed: Connection > refused > at io.netty.channel.unix.Errors.newConnectException0(Errors.java:166) > at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:131) > at io.netty.channel.unix.Socket.finishConnect(Socket.java:359) > at > io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:715) > at > io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:692) > at > io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) > at > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:491) > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:748) > 25/08/26 15:47:50,674 WARN [celeborn-netty-rpc-connection-executor-1] > TransportClientFactory: Retry create client, times 2/3 with error: Failed to > connect to /127.0.0.1:0 > org.apache.celeborn.common.exception.CelebornIOException: Failed to connect > to /127.0.0.1:0 > at > org.apache.celeborn.common.network.client.TransportClientFactory.internalCreateClient(TransportClientFactory.java:317) > at > org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:252) > at > org.apache.celeborn.common.network.client.TransportClientFactory.retryCreateClient(TransportClientFactory.java:159) > at > org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:147) > at > org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:259) > at > org.apache.celeborn.common.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:234) > at > org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194) > at > org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: > finishConnect(..) failed: Connection refused: /127.0.0.1:0 > Caused by: java.net.ConnectException: finishConnect(..) failed: Connection > refused > at io.netty.channel.unix.Errors.newConnectException0(Errors.java:166) > at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:131) > at io.netty.channel.unix.Socket.finishConnect(Socket.java:359) > at > io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:715) > at > io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:692) > at > io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) > at > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:491) > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:748) > {code} > Therefore, we propose extending the ratis-shell add command to support > setting clientAddress and adminAddress parameters when adding new peers to > the cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010)