Re: [I] failed to commit txn: MaybeCommitted [doris]

via GitHub Wed, 17 Sep 2025 22:05:09 -0700


AItale-4 commented on issue #56154:
URL: https://github.com/apache/doris/issues/56154#issuecomment-3305171000


   > It is possible that concurrent INSERT OVERWRITE operations on the same 
partition by two threads caused one of them to fail. You can check the audit 
logs of all FEs around 10:12 on September 17, 2025, with the filter keyword set 
to the name of this import table. If there are too many edit logs, add the 
keyword "overwrite" for further filtering. If there are conflicting write 
operations on the same partition, then this is the root cause of the issue.
   
   I encountered the same error again. Below are the logs from each FE node. 
Could you please help me confirm whether the cause is indeed as you described, 
or if it might be due to something else? Also, since I only executed a single 
INSERT OVERWRITE operation, why would there be concurrent multi-threaded writes?
   
   ————————————FE Master————————————
   2025-09-17 21:03:37,493 INFO (Statistics Job Appender|348) 
[ConnectContext.getCloudCluster():1335] finally set context compute group name 
defaultComputeGroup for user 'admin'@'%' with chose way 'authorized cluster'
   2025-09-17 21:03:37,752 INFO (InsertOverwriteDropDirtyPartitions|348) 
[InsertOverwriteManager.runAfterCatalogReady():369] start clean insert 
overwrite temp partitions
   2025-09-17 21:03:38,011 INFO (thrift-server-pool-126|348) 
[CollectRelation.collectFromUnboundRelation():184] collect table [internal, 
ads, ads_mes_hr_attendance_attend_check_zh_df] from INSERT_TARGET
   2025-09-17 21:03:38,011 INFO (thrift-server-pool-126|348) 
[CollectRelation.collectFromUnboundRelation():184] collect table [hive, ads, 
ads_mes_hr_attendance_attend_check_df] from QUERY
   2025-09-17 21:03:38,013 INFO (NotCheckpointRowCountRefreshExecutor-49|348) 
[HMSExternalTable.fetchRowCount():617] Will estimate row count for table 
ads_mes_hr_attendance_attend_check_df from file list.
   2025-09-17 21:03:38,013 INFO (NotCheckpointRowCountRefreshExecutor-49|348) 
[HMSExternalTable.getAllPartitionValues():985] Partition values size for hive 
table ads_mes_hr_attendance_attend_check_df is 32
   2025-09-17 21:03:38,033 INFO (NotCheckpointRowCountRefreshExecutor-49|348) 
[HMSExternalTable.getFilesForPartitions():1023] Partition list size for hive 
partition table ads_mes_hr_attendance_attend_check_df is 30
   2025-09-17 21:03:38,036 INFO (NotCheckpointRowCountRefreshExecutor-49|348) 
[HMSExternalTable.getRowCountFromFileList():932] Number of files selected for 
hive table ads_mes_hr_attendance_attend_check_df is 30
   2025-09-17 21:03:38,036 INFO (NotCheckpointRowCountRefreshExecutor-49|348) 
[HMSExternalTable.getRowCountFromFileList():957] Table 
ads_mes_hr_attendance_attend_check_df sampled 30 of 32 partitions, sampled size 
is 334086687
   2025-09-17 21:03:38,036 INFO (NotCheckpointRowCountRefreshExecutor-49|348) 
[HMSExternalTable.getRowCountFromFileList():962] Table 
ads_mes_hr_attendance_attend_check_df rows 1713265, total size is 356359132, 
estimatedRowSize is 208
   2025-09-17 21:03:38,194 INFO (thrift-server-pool-126|348) 
[PropertyAnalyzer.analyzeDataProperty():380] Can not assign cool down timestamp 
to HDD storage medium, ignore user setting.
   2025-09-17 21:03:38,493 INFO (Statistics Job Appender|348) 
[ConnectContext.getCloudCluster():1335] finally set context compute group name 
defaultComputeGroup for user 'admin'@'%' with chose way 'authorized cluster'
   2025-09-17 21:03:39,497 INFO (Statistics Job Appender|348) 
[ConnectContext.getCloudCluster():1335] finally set context compute group name 
defaultComputeGroup for user 'admin'@'%' with chose way 'authorized cluster'
   2025-09-17 21:03:40,220 WARN (thrift-server-pool-126|348) 
[CloudInternalCatalog.preparePartition():479] preparePartition response: status 
{
   code: KV_TXN_COMMIT_ERR
   msg: "failed to commit txn: MaybeCommitted"
   }
   
   2025-09-17 21:03:40,222 WARN (thrift-server-pool-126|348) 
[InsertOverwriteTableCommand.run():222] insert into overwrite failed with 
task(or group) id 1757751988211
   2025-09-17 21:03:40,222 INFO (thrift-server-pool-126|348) 
[InsertOverwriteManager.taskFail():214] insert overwrite task [1757751988211] 
failed
   2025-09-17 21:03:40,222 INFO (thrift-server-pool-126|348) 
[InsertOverwriteManager.removeTask():260] remove insert overwrite task: 
InsertOverwriteTask{cancel=false, dbId=1754661298419, tableId=1754661299395, 
tempPartitionNames=[iot_temp_17647_ads_mes_hr_atte63370384_1758114218014]}
   2025-09-17 21:03:40,246 WARN (thrift-server-pool-126|348) 
[StmtExecutor.execute():617] Analyze failed. stmt[214338, 
37a560c4fd734381-bd494f63a24b8bee]
   org.apache.doris.common.NereidsException: errCode = 2, detailMessage = 
Failed to ADD PARTITION iot_temp_17647_ads_mes_hr_atte63370384_1758114218014 
LIKE ads_mes_hr_attendance_attend_check_zh_df. Reason: errCode = 2, 
detailMessage = failed to commit txn: MaybeCommitted
   at org.apache.doris.qe.StmtExecutor.executeByNereids(StmtExecutor.java:781) 
~[doris-fe.jar:1.2-SNAPSHOT]
   at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:602) 
~[doris-fe.jar:1.2-SNAPSHOT]
   at org.apache.doris.qe.StmtExecutor.queryRetry(StmtExecutor.java:564) 
~[doris-fe.jar:1.2-SNAPSHOT]
   at 
org.apache.doris.qe.ConnectProcessor.proxyExecute(ConnectProcessor.java:751) 
~[doris-fe.jar:1.2-SNAPSHOT]
   at 
org.apache.doris.service.FrontendServiceImpl.forward(FrontendServiceImpl.java:1161)
 ~[doris-fe.jar:1.2-SNAPSHOT]
   at jdk.internal.reflect.GeneratedMethodAccessor119.invoke(Unknown Source) 
~[?:?]
   at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:?]
   at java.lang.reflect.Method.invoke(Method.java:568) ~[?:?]
   at org.apache.doris.service.FeServer.lambda$start$0(FeServer.java:60) 
~[doris-fe.jar:1.2-SNAPSHOT]
   at jdk.proxy2.$Proxy46.forward(Unknown Source) ~[?:?]
   at 
org.apache.doris.thrift.FrontendService$Processor$forward.getResult(FrontendService.java:4092)
 ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
   at 
org.apache.doris.thrift.FrontendService$Processor$forward.getResult(FrontendService.java:4072)
 ~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) 
~[libthrift-0.16.0.jar:0.16.0]
   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) 
~[libthrift-0.16.0.jar:0.16.0]
   at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:250)
 ~[libthrift-0.16.0.jar:0.16.0]
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) 
~[?:?]
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) 
~[?:?]
   at java.lang.Thread.run(Thread.java:833) ~[?:?]
   Caused by: org.apache.doris.common.AnalysisException: errCode = 2, 
detailMessage = Failed to ADD PARTITION 
iot_temp_17647_ads_mes_hr_atte63370384_1758114218014 LIKE 
ads_mes_hr_attendance_attend_check_zh_df. Reason: errCode = 2, detailMessage = 
failed to commit txn: MaybeCommitted
   
   ————————————FE Follower1————————————
   2025-09-18 09:03:29,822 INFO (replayer|144) [Env.replayJournal():2966] 
replayed journal id is 1025360, replay to journal id is 1025361
   2025-09-18 09:03:30,315 INFO (replayer|144) [Env.replayJournal():2966] 
replayed journal id is 1025361, replay to journal id is 1025362
   2025-09-18 09:03:30,377 INFO (replayer|144) [Env.replayJournal():2966] 
replayed journal id is 1025362, replay to journal id is 1025363
   2025-09-18 09:03:30,378 INFO (replayer|144) [EditLog.loadJournal():254] 
Begin to unprotect add partition. db = 1754661298419 table = 1754661299395 
partitionName = iot_temp_18943_ads_mes_hr_atte1039518318_1758157410311
   2025-09-18 09:03:30,569 INFO (replayer|144) [Env.replayJournal():2966] 
replayed journal id is 1025363, replay to journal id is 1025364
   2025-09-18 09:03:30,571 INFO (replayer|144) 
[AutoIncrementGenerator.applyChange():73] [auto-inc] AutoIncrementGenerator 
applyChange, db_id=1754661298419, table_id=1754661299395, column_id=0, 
batchNextId=1068000000
   2025-09-18 09:03:30,683 INFO (replayer|144) [Env.replayJournal():2966] 
replayed journal id is 1025364, replay to journal id is 1025365
   2025-09-18 09:03:30,684 INFO (replayer|144) 
[AutoIncrementGenerator.applyChange():73] [auto-inc] AutoIncrementGenerator 
applyChange, db_id=1754661298419, table_id=1754661299395, column_id=0, 
batchNextId=1068500000
   2025-09-18 09:03:31,587 INFO (replayer|144) [Env.replayJournal():2966] 
replayed journal id is 1025365, replay to journal id is 1025366
   2025-09-18 09:03:31,589 INFO (replayer|144) [Env.replayJournal():2966] 
replayed journal id is 1025366, replay to journal id is 1025367
   2025-09-18 09:03:31,591 INFO (replayer|144) [Env.replayJournal():2966] 
replayed journal id is 1025367, replay to journal id is 1025368
   2025-09-18 09:03:31,592 INFO (replayer|144) 
[LoadManager.replayCreateLoadJob():186] LOAD_JOB=1757752019419, msg={replay 
create load job}
   2025-09-18 09:03:31,594 INFO (replayer|144) [Env.replayJournal():2966] 
replayed journal id is 1025368, replay to journal id is 1025369
   2025-09-18 09:03:31,594 INFO (replayer|144) 
[CatalogRecycleBin.recyclePartition():203] recycle 
partition[1757752016737-ads_mes_hr_attendance_attend_check_zh_df] of table 
[1754661299395-ads_mes_hr_attendance_attend_check_zh_df]
   2025-09-18 09:03:31,594 INFO (replayer|144) 
[OlapTable.renamePartition():1027] rename partition 
ads_mes_hr_attendance_attend_check_zh_df in table 
ads_mes_hr_attendance_attend_check_zh_df
   2025-09-18 09:03:31,596 INFO (replayer|144) [Env.replayJournal():2966] 
replayed journal id is 1025369, replay to journal id is 1025370
   2025-09-18 09:03:39,651 INFO (replayer|144) [Env.replayJournal():2966] 
replayed journal id is 1025370, replay to journal id is 1025371
   2025-09-18 09:03:39,836 INFO (replayer|144) [Env.replayJournal():2966] 
replayed journal id is 1025371, replay to journal id is 1025372
   2025-09-18 09:03:45,002 INFO (replayer|144) [Env.replayJournal():2966] 
replayed journal id is 1025372, replay to journal id is 1025373
   2025-09-18 09:03:45,818 INFO (replayer|144) [Env.replayJournal():2966] 
replayed journal id is 1025373, replay to journal id is 1025374
   2025-09-18 09:03:45,818 INFO (replayer|144) 
[CatalogRecycleBin.replayErasePartition():592] replay erase 
partition[1757752008935]
   
   ————————————FE Follower2————————————
   2025-09-18 09:03:29,820 INFO (replayer|148) [Env.replayJournal():2966] 
replayed journal id is 1025360, replay to journal id is 1025361
   2025-09-18 09:03:30,280 INFO (mysql-nio-pool-479|148) 
[ConnectContext.getCloudCluster():1335] finally set context compute group name 
defaultComputeGroup for user 'root'@'%' with chose way 'authorized cluster'
   2025-09-18 09:03:30,282 INFO (mysql-nio-pool-479|148) 
[Coordinator.execInternal():756] dispatch result sink of query 
39c1f14499a343a3-b2cea646d223395e to TNetworkAddress(hostname:192.168.1.101, 
port:9060)
   2025-09-18 09:03:30,287 WARN (mysql-nio-pool-479|148) 
[ResultReceiver.cancel():223] ResultReceiver of query 
39c1f14499a343a3-b2cea646d223395e cancel failed, typically means the future is 
finished, cancel reason is Status [errorCode=LIMIT_REACH, errorMsg=query reach 
limit]
   2025-09-18 09:03:30,287 INFO (mysql-nio-pool-479|148) 
[StmtExecutor.handleQueryStmt():1919] Query 39c1f14499a343a3-b2cea646d223395e 
finished
   2025-09-18 09:03:30,289 INFO (mysql-nio-pool-479|148) 
[MasterOpExecutor.forward():164] forward to master FE 
TNetworkAddress(hostname:192.168.1.103, port:9020), statement id: 12218
   2025-09-18 09:03:30,313 INFO (replayer|148) [Env.replayJournal():2966] 
replayed journal id is 1025361, replay to journal id is 1025362
   2025-09-18 09:03:30,374 INFO (replayer|148) [Env.replayJournal():2966] 
replayed journal id is 1025362, replay to journal id is 1025363
   2025-09-18 09:03:30,378 INFO (replayer|148) [EditLog.loadJournal():254] 
Begin to unprotect add partition. db = 1754661298419 table = 1754661299395 
partitionName = iot_temp_18943_ads_mes_hr_atte1039518318_1758157410311
   2025-09-18 09:03:30,565 INFO (replayer|148) [Env.replayJournal():2966] 
replayed journal id is 1025363, replay to journal id is 1025364
   2025-09-18 09:03:30,565 INFO (replayer|148) 
[AutoIncrementGenerator.applyChange():73] [auto-inc] AutoIncrementGenerator 
applyChange, db_id=1754661298419, table_id=1754661299395, column_id=0, 
batchNextId=1068000000
   2025-09-18 09:03:30,681 INFO (replayer|148) [Env.replayJournal():2966] 
replayed journal id is 1025364, replay to journal id is 1025365
   2025-09-18 09:03:30,682 INFO (replayer|148) 
[AutoIncrementGenerator.applyChange():73] [auto-inc] AutoIncrementGenerator 
applyChange, db_id=1754661298419, table_id=1754661299395, column_id=0, 
batchNextId=1068500000
   2025-09-18 09:03:31,585 INFO (replayer|148) [Env.replayJournal():2966] 
replayed journal id is 1025365, replay to journal id is 1025366
   2025-09-18 09:03:31,588 INFO (replayer|148) [Env.replayJournal():2966] 
replayed journal id is 1025366, replay to journal id is 1025367
   2025-09-18 09:03:31,590 INFO (replayer|148) [Env.replayJournal():2966] 
replayed journal id is 1025367, replay to journal id is 1025368
   2025-09-18 09:03:31,590 INFO (replayer|148) 
[LoadManager.replayCreateLoadJob():186] LOAD_JOB=1757752019419, msg={replay 
create load job}
   2025-09-18 09:03:31,593 INFO (replayer|148) [Env.replayJournal():2966] 
replayed journal id is 1025368, replay to journal id is 1025370
   2025-09-18 09:03:31,593 INFO (replayer|148) 
[CatalogRecycleBin.recyclePartition():203] recycle 
partition[1757752016737-ads_mes_hr_attendance_attend_check_zh_df] of table 
[1754661299395-ads_mes_hr_attendance_attend_check_zh_df]
   2025-09-18 09:03:31,593 INFO (replayer|148) 
[OlapTable.renamePartition():1027] rename partition 
ads_mes_hr_attendance_attend_check_zh_df in table 
ads_mes_hr_attendance_attend_check_zh_df
   2025-09-18 09:03:31,594 INFO (mysql-nio-pool-479|148) 
[MasterOpExecutor.waitOnReplaying():138] forwarding to master get result max 
journal id: 1025370
   2025-09-18 09:03:31,594 INFO (mysql-nio-pool-479|148) 
[JournalObservable.waitOn():54] waiting for the observer to replay journal to 
1025370 with timeout: 17280000 ms
   2025-09-18 09:03:39,649 INFO (replayer|148) [Env.replayJournal():2966] 
replayed journal id is 1025370, replay to journal id is 1025371
   2025-09-18 09:03:39,825 INFO (replayer|148) [Env.replayJournal():2966] 
replayed journal id is 1025371, replay to journal id is 1025372
   2025-09-18 09:03:45,000 INFO (replayer|148) [Env.replayJournal():2966] 
replayed journal id is 1025372, replay to journal id is 1025373
   2025-09-18 09:03:45,816 INFO (replayer|148) [Env.replayJournal():2966] 
replayed journal id is 1025373, replay to journal id is 1025374
   2025-09-18 09:03:45,816 INFO (replayer|148) 
[CatalogRecycleBin.replayErasePartition():592] replay erase 
partition[1757752008935]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] failed to commit txn: MaybeCommitted [doris]

Reply via email to