AItale-4 commented on issue #56154:
URL: https://github.com/apache/doris/issues/56154#issuecomment-3305171000
> It is possible that concurrent INSERT OVERWRITE operations on the same
partition by two threads caused one of them to fail. You can check the audit
logs of all FEs around 10:12 on September 17, 2025, with the filter keyword set
to the name of this import table. If there are too many edit logs, add the
keyword "overwrite" for further filtering. If there are conflicting write
operations on the same partition, then this is the root cause of the issue.
I encountered the same error again. Below are the logs from each FE node.
Could you please help me confirm whether the cause is indeed as you described,
or if it might be due to something else? Also, since I only executed a single
INSERT OVERWRITE operation, why would there be concurrent multi-threaded writes?
————————————FE Master————————————
2025-09-17 21:03:37,493 INFO (Statistics Job Appender|348)
[ConnectContext.getCloudCluster():1335] finally set context compute group name
defaultComputeGroup for user 'admin'@'%' with chose way 'authorized cluster'
2025-09-17 21:03:37,752 INFO (InsertOverwriteDropDirtyPartitions|348)
[InsertOverwriteManager.runAfterCatalogReady():369] start clean insert
overwrite temp partitions
2025-09-17 21:03:38,011 INFO (thrift-server-pool-126|348)
[CollectRelation.collectFromUnboundRelation():184] collect table [internal,
ads, ads_mes_hr_attendance_attend_check_zh_df] from INSERT_TARGET
2025-09-17 21:03:38,011 INFO (thrift-server-pool-126|348)
[CollectRelation.collectFromUnboundRelation():184] collect table [hive, ads,
ads_mes_hr_attendance_attend_check_df] from QUERY
2025-09-17 21:03:38,013 INFO (NotCheckpointRowCountRefreshExecutor-49|348)
[HMSExternalTable.fetchRowCount():617] Will estimate row count for table
ads_mes_hr_attendance_attend_check_df from file list.
2025-09-17 21:03:38,013 INFO (NotCheckpointRowCountRefreshExecutor-49|348)
[HMSExternalTable.getAllPartitionValues():985] Partition values size for hive
table ads_mes_hr_attendance_attend_check_df is 32
2025-09-17 21:03:38,033 INFO (NotCheckpointRowCountRefreshExecutor-49|348)
[HMSExternalTable.getFilesForPartitions():1023] Partition list size for hive
partition table ads_mes_hr_attendance_attend_check_df is 30
2025-09-17 21:03:38,036 INFO (NotCheckpointRowCountRefreshExecutor-49|348)
[HMSExternalTable.getRowCountFromFileList():932] Number of files selected for
hive table ads_mes_hr_attendance_attend_check_df is 30
2025-09-17 21:03:38,036 INFO (NotCheckpointRowCountRefreshExecutor-49|348)
[HMSExternalTable.getRowCountFromFileList():957] Table
ads_mes_hr_attendance_attend_check_df sampled 30 of 32 partitions, sampled size
is 334086687
2025-09-17 21:03:38,036 INFO (NotCheckpointRowCountRefreshExecutor-49|348)
[HMSExternalTable.getRowCountFromFileList():962] Table
ads_mes_hr_attendance_attend_check_df rows 1713265, total size is 356359132,
estimatedRowSize is 208
2025-09-17 21:03:38,194 INFO (thrift-server-pool-126|348)
[PropertyAnalyzer.analyzeDataProperty():380] Can not assign cool down timestamp
to HDD storage medium, ignore user setting.
2025-09-17 21:03:38,493 INFO (Statistics Job Appender|348)
[ConnectContext.getCloudCluster():1335] finally set context compute group name
defaultComputeGroup for user 'admin'@'%' with chose way 'authorized cluster'
2025-09-17 21:03:39,497 INFO (Statistics Job Appender|348)
[ConnectContext.getCloudCluster():1335] finally set context compute group name
defaultComputeGroup for user 'admin'@'%' with chose way 'authorized cluster'
2025-09-17 21:03:40,220 WARN (thrift-server-pool-126|348)
[CloudInternalCatalog.preparePartition():479] preparePartition response: status
{
code: KV_TXN_COMMIT_ERR
msg: "failed to commit txn: MaybeCommitted"
}
2025-09-17 21:03:40,222 WARN (thrift-server-pool-126|348)
[InsertOverwriteTableCommand.run():222] insert into overwrite failed with
task(or group) id 1757751988211
2025-09-17 21:03:40,222 INFO (thrift-server-pool-126|348)
[InsertOverwriteManager.taskFail():214] insert overwrite task [1757751988211]
failed
2025-09-17 21:03:40,222 INFO (thrift-server-pool-126|348)
[InsertOverwriteManager.removeTask():260] remove insert overwrite task:
InsertOverwriteTask{cancel=false, dbId=1754661298419, tableId=1754661299395,
tempPartitionNames=[iot_temp_17647_ads_mes_hr_atte63370384_1758114218014]}
2025-09-17 21:03:40,246 WARN (thrift-server-pool-126|348)
[StmtExecutor.execute():617] Analyze failed. stmt[214338,
37a560c4fd734381-bd494f63a24b8bee]
org.apache.doris.common.NereidsException: errCode = 2, detailMessage =
Failed to ADD PARTITION iot_temp_17647_ads_mes_hr_atte63370384_1758114218014
LIKE ads_mes_hr_attendance_attend_check_zh_df. Reason: errCode = 2,
detailMessage = failed to commit txn: MaybeCommitted
at org.apache.doris.qe.StmtExecutor.executeByNereids(StmtExecutor.java:781)
~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:602)
~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.qe.StmtExecutor.queryRetry(StmtExecutor.java:564)
~[doris-fe.jar:1.2-SNAPSHOT]
at
org.apache.doris.qe.ConnectProcessor.proxyExecute(ConnectProcessor.java:751)
~[doris-fe.jar:1.2-SNAPSHOT]
at
org.apache.doris.service.FrontendServiceImpl.forward(FrontendServiceImpl.java:1161)
~[doris-fe.jar:1.2-SNAPSHOT]
at jdk.internal.reflect.GeneratedMethodAccessor119.invoke(Unknown Source)
~[?:?]
at
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:?]
at java.lang.reflect.Method.invoke(Method.java:568) ~[?:?]
at org.apache.doris.service.FeServer.lambda$start$0(FeServer.java:60)
~[doris-fe.jar:1.2-SNAPSHOT]
at jdk.proxy2.$Proxy46.forward(Unknown Source) ~[?:?]
at
org.apache.doris.thrift.FrontendService$Processor$forward.getResult(FrontendService.java:4092)
~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
at
org.apache.doris.thrift.FrontendService$Processor$forward.getResult(FrontendService.java:4072)
~[fe-common-1.2-SNAPSHOT.jar:1.2-SNAPSHOT]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
~[libthrift-0.16.0.jar:0.16.0]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
~[libthrift-0.16.0.jar:0.16.0]
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:250)
~[libthrift-0.16.0.jar:0.16.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
~[?:?]
at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: org.apache.doris.common.AnalysisException: errCode = 2,
detailMessage = Failed to ADD PARTITION
iot_temp_17647_ads_mes_hr_atte63370384_1758114218014 LIKE
ads_mes_hr_attendance_attend_check_zh_df. Reason: errCode = 2, detailMessage =
failed to commit txn: MaybeCommitted
————————————FE Follower1————————————
2025-09-18 09:03:29,822 INFO (replayer|144) [Env.replayJournal():2966]
replayed journal id is 1025360, replay to journal id is 1025361
2025-09-18 09:03:30,315 INFO (replayer|144) [Env.replayJournal():2966]
replayed journal id is 1025361, replay to journal id is 1025362
2025-09-18 09:03:30,377 INFO (replayer|144) [Env.replayJournal():2966]
replayed journal id is 1025362, replay to journal id is 1025363
2025-09-18 09:03:30,378 INFO (replayer|144) [EditLog.loadJournal():254]
Begin to unprotect add partition. db = 1754661298419 table = 1754661299395
partitionName = iot_temp_18943_ads_mes_hr_atte1039518318_1758157410311
2025-09-18 09:03:30,569 INFO (replayer|144) [Env.replayJournal():2966]
replayed journal id is 1025363, replay to journal id is 1025364
2025-09-18 09:03:30,571 INFO (replayer|144)
[AutoIncrementGenerator.applyChange():73] [auto-inc] AutoIncrementGenerator
applyChange, db_id=1754661298419, table_id=1754661299395, column_id=0,
batchNextId=1068000000
2025-09-18 09:03:30,683 INFO (replayer|144) [Env.replayJournal():2966]
replayed journal id is 1025364, replay to journal id is 1025365
2025-09-18 09:03:30,684 INFO (replayer|144)
[AutoIncrementGenerator.applyChange():73] [auto-inc] AutoIncrementGenerator
applyChange, db_id=1754661298419, table_id=1754661299395, column_id=0,
batchNextId=1068500000
2025-09-18 09:03:31,587 INFO (replayer|144) [Env.replayJournal():2966]
replayed journal id is 1025365, replay to journal id is 1025366
2025-09-18 09:03:31,589 INFO (replayer|144) [Env.replayJournal():2966]
replayed journal id is 1025366, replay to journal id is 1025367
2025-09-18 09:03:31,591 INFO (replayer|144) [Env.replayJournal():2966]
replayed journal id is 1025367, replay to journal id is 1025368
2025-09-18 09:03:31,592 INFO (replayer|144)
[LoadManager.replayCreateLoadJob():186] LOAD_JOB=1757752019419, msg={replay
create load job}
2025-09-18 09:03:31,594 INFO (replayer|144) [Env.replayJournal():2966]
replayed journal id is 1025368, replay to journal id is 1025369
2025-09-18 09:03:31,594 INFO (replayer|144)
[CatalogRecycleBin.recyclePartition():203] recycle
partition[1757752016737-ads_mes_hr_attendance_attend_check_zh_df] of table
[1754661299395-ads_mes_hr_attendance_attend_check_zh_df]
2025-09-18 09:03:31,594 INFO (replayer|144)
[OlapTable.renamePartition():1027] rename partition
ads_mes_hr_attendance_attend_check_zh_df in table
ads_mes_hr_attendance_attend_check_zh_df
2025-09-18 09:03:31,596 INFO (replayer|144) [Env.replayJournal():2966]
replayed journal id is 1025369, replay to journal id is 1025370
2025-09-18 09:03:39,651 INFO (replayer|144) [Env.replayJournal():2966]
replayed journal id is 1025370, replay to journal id is 1025371
2025-09-18 09:03:39,836 INFO (replayer|144) [Env.replayJournal():2966]
replayed journal id is 1025371, replay to journal id is 1025372
2025-09-18 09:03:45,002 INFO (replayer|144) [Env.replayJournal():2966]
replayed journal id is 1025372, replay to journal id is 1025373
2025-09-18 09:03:45,818 INFO (replayer|144) [Env.replayJournal():2966]
replayed journal id is 1025373, replay to journal id is 1025374
2025-09-18 09:03:45,818 INFO (replayer|144)
[CatalogRecycleBin.replayErasePartition():592] replay erase
partition[1757752008935]
————————————FE Follower2————————————
2025-09-18 09:03:29,820 INFO (replayer|148) [Env.replayJournal():2966]
replayed journal id is 1025360, replay to journal id is 1025361
2025-09-18 09:03:30,280 INFO (mysql-nio-pool-479|148)
[ConnectContext.getCloudCluster():1335] finally set context compute group name
defaultComputeGroup for user 'root'@'%' with chose way 'authorized cluster'
2025-09-18 09:03:30,282 INFO (mysql-nio-pool-479|148)
[Coordinator.execInternal():756] dispatch result sink of query
39c1f14499a343a3-b2cea646d223395e to TNetworkAddress(hostname:192.168.1.101,
port:9060)
2025-09-18 09:03:30,287 WARN (mysql-nio-pool-479|148)
[ResultReceiver.cancel():223] ResultReceiver of query
39c1f14499a343a3-b2cea646d223395e cancel failed, typically means the future is
finished, cancel reason is Status [errorCode=LIMIT_REACH, errorMsg=query reach
limit]
2025-09-18 09:03:30,287 INFO (mysql-nio-pool-479|148)
[StmtExecutor.handleQueryStmt():1919] Query 39c1f14499a343a3-b2cea646d223395e
finished
2025-09-18 09:03:30,289 INFO (mysql-nio-pool-479|148)
[MasterOpExecutor.forward():164] forward to master FE
TNetworkAddress(hostname:192.168.1.103, port:9020), statement id: 12218
2025-09-18 09:03:30,313 INFO (replayer|148) [Env.replayJournal():2966]
replayed journal id is 1025361, replay to journal id is 1025362
2025-09-18 09:03:30,374 INFO (replayer|148) [Env.replayJournal():2966]
replayed journal id is 1025362, replay to journal id is 1025363
2025-09-18 09:03:30,378 INFO (replayer|148) [EditLog.loadJournal():254]
Begin to unprotect add partition. db = 1754661298419 table = 1754661299395
partitionName = iot_temp_18943_ads_mes_hr_atte1039518318_1758157410311
2025-09-18 09:03:30,565 INFO (replayer|148) [Env.replayJournal():2966]
replayed journal id is 1025363, replay to journal id is 1025364
2025-09-18 09:03:30,565 INFO (replayer|148)
[AutoIncrementGenerator.applyChange():73] [auto-inc] AutoIncrementGenerator
applyChange, db_id=1754661298419, table_id=1754661299395, column_id=0,
batchNextId=1068000000
2025-09-18 09:03:30,681 INFO (replayer|148) [Env.replayJournal():2966]
replayed journal id is 1025364, replay to journal id is 1025365
2025-09-18 09:03:30,682 INFO (replayer|148)
[AutoIncrementGenerator.applyChange():73] [auto-inc] AutoIncrementGenerator
applyChange, db_id=1754661298419, table_id=1754661299395, column_id=0,
batchNextId=1068500000
2025-09-18 09:03:31,585 INFO (replayer|148) [Env.replayJournal():2966]
replayed journal id is 1025365, replay to journal id is 1025366
2025-09-18 09:03:31,588 INFO (replayer|148) [Env.replayJournal():2966]
replayed journal id is 1025366, replay to journal id is 1025367
2025-09-18 09:03:31,590 INFO (replayer|148) [Env.replayJournal():2966]
replayed journal id is 1025367, replay to journal id is 1025368
2025-09-18 09:03:31,590 INFO (replayer|148)
[LoadManager.replayCreateLoadJob():186] LOAD_JOB=1757752019419, msg={replay
create load job}
2025-09-18 09:03:31,593 INFO (replayer|148) [Env.replayJournal():2966]
replayed journal id is 1025368, replay to journal id is 1025370
2025-09-18 09:03:31,593 INFO (replayer|148)
[CatalogRecycleBin.recyclePartition():203] recycle
partition[1757752016737-ads_mes_hr_attendance_attend_check_zh_df] of table
[1754661299395-ads_mes_hr_attendance_attend_check_zh_df]
2025-09-18 09:03:31,593 INFO (replayer|148)
[OlapTable.renamePartition():1027] rename partition
ads_mes_hr_attendance_attend_check_zh_df in table
ads_mes_hr_attendance_attend_check_zh_df
2025-09-18 09:03:31,594 INFO (mysql-nio-pool-479|148)
[MasterOpExecutor.waitOnReplaying():138] forwarding to master get result max
journal id: 1025370
2025-09-18 09:03:31,594 INFO (mysql-nio-pool-479|148)
[JournalObservable.waitOn():54] waiting for the observer to replay journal to
1025370 with timeout: 17280000 ms
2025-09-18 09:03:39,649 INFO (replayer|148) [Env.replayJournal():2966]
replayed journal id is 1025370, replay to journal id is 1025371
2025-09-18 09:03:39,825 INFO (replayer|148) [Env.replayJournal():2966]
replayed journal id is 1025371, replay to journal id is 1025372
2025-09-18 09:03:45,000 INFO (replayer|148) [Env.replayJournal():2966]
replayed journal id is 1025372, replay to journal id is 1025373
2025-09-18 09:03:45,816 INFO (replayer|148) [Env.replayJournal():2966]
replayed journal id is 1025373, replay to journal id is 1025374
2025-09-18 09:03:45,816 INFO (replayer|148)
[CatalogRecycleBin.replayErasePartition():592] replay erase
partition[1757752008935]
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]