AlexLWei opened a new issue, #27709: URL: https://github.com/apache/doris/issues/27709
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Version 1.2.7 -升级至->2.0.2 BE 已完成升级,目前卡在Fe升级中 ### What's Wrong? fe升级流程为: 关闭所有fe/be节点 复制doris-meta元数据至新版Fe节点中,启动新版Fe。然后全Alter system drop掉所有其他Fe再重新添加。 Drop Fe 节点没有问题,ADD FOLLOWER时后台重试三次爆出日志:fe.log: ``` 2023-11-28 09:49:56,110 ERROR (mysql-nio-pool-0|328) [BDBJEJournal.write():180] catch an exception when writing to database. sleep and retry. journal id 155010628 com.sleepycat.je.rep.InsufficientReplicasException: (JE 18.3.12) Commit policy: SIMPLE_MAJORITY required 1 replica. But none were active with this master. at com.sleepycat.je.rep.impl.node.DurabilityQuorum.ensureReplicasForCommit(DurabilityQuorum.java:116) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at com.sleepycat.je.rep.impl.RepImpl.txnBeginHook(RepImpl.java:1171) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at com.sleepycat.je.rep.txn.MasterTxn.txnBeginHook(MasterTxn.java:195) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at com.sleepycat.je.txn.Txn.initTxn(Txn.java:384) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at com.sleepycat.je.txn.Txn.<init>(Txn.java:288) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at com.sleepycat.je.txn.Txn.<init>(Txn.java:267) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at com.sleepycat.je.rep.txn.MasterTxn.<init>(MasterTxn.java:146) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at com.sleepycat.je.rep.txn.MasterTxn$1.create(MasterTxn.java:117) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at com.sleepycat.je.rep.txn.MasterTxn.create(MasterTxn.java:435) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at com.sleepycat.je.rep.impl.RepImpl.createRepUserTxn(RepImpl.java:1145) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at com.sleepycat.je.txn.Txn.createAutoTxn(Txn.java:334) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at com.sleepycat.je.txn.LockerFactory.getWritableLocker(LockerFactory.java:79) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at com.sleepycat.je.txn.LockerFactory.getWritableLocker(LockerFactory.java:40) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at com.sleepycat.je.Database.put(Database.java:1625) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at com.sleepycat.je.Database.put(Database.java:1688) ~[je-18.3.14-doris-SNAPSHOT.jar:18.3.14-doris-SNAPSHOT] at org.apache.doris.journal.bdbje.BDBJEJournal.write(BDBJEJournal.java:151) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.persist.EditLog.logEdit(EditLog.java:1143) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.persist.EditLog.logAddFrontend(EditLog.java:1335) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.catalog.Env.addFrontend(Env.java:2590) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.alter.SystemHandler.process(SystemHandler.java:153) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.alter.AlterHandler.process(AlterHandler.java:185) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.alter.Alter.processAlterCluster(Alter.java:736) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.catalog.Env.alterCluster(Env.java:4681) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.qe.DdlExecutor.execute(DdlExecutor.java:207) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.qe.StmtExecutor.handleDdlStmt(StmtExecutor.java:2184) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.qe.StmtExecutor.executeByLegacy(StmtExecutor.java:749) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:451) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:422) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:435) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.qe.ConnectProcessor.dispatch(ConnectProcessor.java:583) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.qe.ConnectProcessor.processOnce(ConnectProcessor.java:834) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.mysql.ReadListener.lambda$handleEvent$0(ReadListener.java:52) ~[doris-fe.jar:1.2-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_292] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_292] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292] ..... 2023-11-28 09:50:01,111 ERROR (mysql-nio-pool-0|328) [BDBJEJournal.write():203] write bdb failed. will exit. journalId: 155010628, bdb database Name: 155010580 ``` fe.out: ``` [2023-11-28 09:50:01] write bdb failed. will exit. journalId: 155010628, bdb database Name: 155010580 ``` 然后Fe 挂掉。 但是测试ADD OBSERVER不会受到影响。 目前发现回到原来环境ADD FOLLOWER也会出现上述问题,只是下列日志会变为 com.sleepycat.je.rep.InsufficientReplicasException: (JE 18.3.12) Commit policy: SIMPLE_MAJORITY required 3 replica. But none were 2 active with this master (ip1 ip2). 其中集群的FE FOLLOWER 为3个,且上述的两个IP为非MASTER IP 猜测是执行该命令时master直接挂了导致。 疑似可能是上一次升级(1.1.5 ——> 1.2.7)时元数据恢复使用 metadata_failure_recovery 操作不当 导致,但是对正常数据处理等使用不影响。 ### What You Expected? 该如何、从哪方面下手处理这个问题? 目前升级需要将Fe迁移至新的集群,所以急需解决这个问题。 ### How to Reproduce? _No response_ ### Anything Else? _No response_ ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
