akskg opened a new issue, #52760: URL: https://github.com/apache/doris/issues/52760
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Version 3.0.5/3.0.6 ### What's Wrong? ### 测试场景: 使用stream_load的group commit持续向集群写入数据,表开启 enable_single_replica_compaction=true, 表设计为2副本 ### 问题出现过程: be节点重启(正常或者异常) ### 问题现象: 1、系统的事务数达到10000(上限)  事务全部是label为group_commit,持续关闭客户端写入程序仍然会不断产生这类事务,并且一直处于PREPARE状态,后续将自动ABORT  2、客户端stream load写入出现大量异常,提示can not get a block queue for table_id:xxx 3、部分表的tablet两副本中的一个副本的version数达到4000(上限) be后台日志: `W20250701 15:06:52.967594 3868301 wal_table.cpp:106] failed to replay wal=/sdata1/doris/storage/wal/1749187870097/1749546354117/1_1749187869558_131370564_group_commit_f246bc879e0bc12b_589e13b2752925ab, st=[INTERNAL_ERROR][INTERNAL_ERROR][INTERNAL_ERROR]tablet error: [E-235]failed to init rowset builder. version count: 4003, exceed limit: 4000, tablet: 1749650199440. Please reduce the frequency of loading data or adjust the max_tablet_version_num in be.conf to a larger value., host: [xxxxx](http://xxxxx/), host: [xxxxx](http://xxxxx/)` 查看这个tablet的两副本情况: 其中一个副本version数较小,另一个version数达到4000 通过api查看tablet的compaction状态  ### 猜测问题出现的过程 猜测问题的情况是由于开启了 enable_single_replica_compaction=true后 ,因为节点宕机或者重启后,部分tablet在此过程中在另外的节点完成了compaction,异常节点恢复后这部分的compaction文件未同步到恢复节点,由于数据不断写入异常节点tablet副本version数将会不断上涨,又会触发compaction动作后,但是由于表是开启enable_single_replica_compaction状态的所以compaction会被异常终止,如此反复陷入死循环,造成事务池被耗尽,从而引发集群异常。 ### 数据恢复方案 目前我出现此类情况后验证可以修复方法是 1、重新分配异常version的tablet副本,通过设置副本状态drop,让fe重新分配新的副本完成修复 `ADMIN SET REPLICA STATUS PROPERTIES ("tablet_id"="xxxx","backend_id"="xxxx","status"="drop");` 2、将表的enable_single_replica_compaction改为false,异常副本将会自行开启compaction ### What You Expected? 针对enable_single_replica_compaction=true在compaction合并文件同步的策略上进行优化 1、如果出现异常情况,tablet副本也可以自行进行compaction,而不是现在的完全不进行compaction 2、希望优化避免事务由于version不断上涨,又不断触发compaction这种的死循环策略引发的集群异常 ### How to Reproduce? _No response_ ### Anything Else? _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
