This is an automated email from the ASF dual-hosted git repository. chengpan pushed a commit to branch branch-0.3 in repository https://gitbox.apache.org/repos/asf/incubator-celeborn.git
commit fd5ecc1e31474adef9d4e18a01bf42d1f1899536 Author: sychen <[email protected]> AuthorDate: Thu Sep 28 19:02:59 2023 +0800 [CELEBORN-1013] Shutdown master if initialized failed ### What changes were proposed in this pull request? ```java 23/09/28 14:48:12,512 ERROR [main] Master: Initialize master failed. java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:461) at sun.nio.ch.Net.bind(Net.java:453) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:222) at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:141) at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:562) at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334) ``` ### Why are the changes needed? For example, bind's http service port(`celeborn.metrics.master.prometheus.port`) port is occupied and master startup fails, but because the thread started by Raft is not a daemon, the master process still exists. https://github.com/apache/ratis/blob/d461a01a53e7e130f0ec4143e75b316012137b62/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L283-L290 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1945 from cxzl25/CELEBORN-1013. Authored-by: sychen <[email protected]> Signed-off-by: Cheng Pan <[email protected]> (cherry picked from commit 16bf2aeeaa1767db5f6b676818ce4c9062ed2608) Signed-off-by: Cheng Pan <[email protected]> --- .../org/apache/celeborn/service/deploy/master/Master.scala | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala b/master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala index 4344461c5..737a8a9b5 100644 --- a/master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala +++ b/master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala @@ -973,7 +973,13 @@ private[deploy] object Master extends Logging { def main(args: Array[String]): Unit = { val conf = new CelebornConf() val masterArgs = new MasterArguments(args, conf) - val master = new Master(conf, masterArgs) - master.initialize() + try { + val master = new Master(conf, masterArgs) + master.initialize() + } catch { + case e: Throwable => + logError("Initialize master failed.", e) + System.exit(-1) + } } }
