deardeng opened a new pull request, #63069:
URL: https://github.com/apache/doris/pull/63069
master FE PublishVersionDaemon repeatedly throws NPE for a single "poisoned"
transaction at tens of times per second, indefinitely:
java.lang.NullPointerException: Cannot invoke
"java.util.Map.containsKey(Object)" because the return value of
"PublishVersionTask.getSuccTablets()" is null
at DatabaseTransactionMgr.checkReplicaContinuousVersionSucc(...)
at DatabaseTransactionMgr.finishCheckQuorumReplicas(...)
at DatabaseTransactionMgr.finishTransaction(...)
at GlobalTransactionMgr.finishTransaction(...)
at PublishVersionDaemon.tryFinishTxnSync(...)
The affected transaction never reaches VISIBLE; user-visible writes are
stuck "committed but not queryable" until publish_version_timeout_second
finally aborts them. Other concurrent transactions keep publishing normally;
only the poisoned one loops on NPE.
The check at DatabaseTransactionMgr.checkReplicaContinuousVersionSucc
success = (task != null && task.isFinished()
&& task.getSuccTablets().containsKey(tabletId)) || (...);
assumes the invariant "task.isFinished() == true => succTablets is
populated". PublishVersionTask initialises succTablets to null in its
constructor. One real and one latent code path can flip isFinished to true
while leaving succTablets at that null:
1. AgentTaskCleanupDaemon.removeInactiveBeAgentTasks (the production
trigger): when a BE has been unreachable for MAX_FAILURE_TIMES (=3) heartbeat
periods, every queued agent task on that BE -- including in-flight
PublishVersionTasks -- is force-finished via setFinished(true) +
setErrorCode(ABORTED). The daemon never calls setSuccTablets, so succTablets
stays at the constructor's null. This path bypasses the BE -> MasterImpl
finish-task RPC entirely, so nothing else ever fills the field.
2. MasterImpl.finishPublishVersion (latent / defensive): structurally
setSuccTablets(possibly-null) and setFinished(true) run before the status-code
early return, so a null succTablets here would be reachable in principle. Under
the current BE protocol it is not: be/src/agent/task_worker_pool.cpp
unconditionally calls __set_succ_tablets(succ_tablets) before sending the
finish RPC, even when publish itself failed and the map is empty. That makes
request.isSetSuccTablets() always true and getSuccTablets() returns at worst an
empty (non-null) map, so this path does not produce the NPE today. It is called
out only as a latent risk that the call-site guard also covers should a future
BE-side change ever stop setting that field. The production NPE was reproduced
solely via path #1.
Why the bad state is sticky after BE recovery: the same Java object is held
by both AgentTaskQueue and TransactionState.publishVersionTasks (see
PublishVersionDaemon.addPublishVersionTask). AgentTaskCleanupDaemon removes the
AgentTaskQueue reference, but the
TransactionState.publishVersionTasks reference is only cleared in
TransactionState.pruneAfterVisible, which is invoked from PublishVersionDaemon
only when the transaction reaches VISIBLE. NPE prevents VISIBLE -> reference is
never pruned -> daemon polls the same poisoned object forever.
A previous defensive change at this call site only added a "task != null"
guard and did not cover succTablets == null, leaving the gap that the
AgentTaskCleanupDaemon path later started to widen in production.
Two complementary changes that together defend the invariant from both sides:
1. PublishVersionTask: initialise succTablets to Maps.newHashMap() and
coerce setSuccTablets(null) to an empty map. getSuccTablets() now never returns
null. This replaces "no entry yet" with "empty map", which is semantically
equivalent for every existing caller (containsKey, forEach, entrySet all behave
the same on an empty map).
2. DatabaseTransactionMgr.checkReplicaContinuousVersionSucc: explicit "succ
!= null" guard at the call site. With change #1 this is a belt-and-braces check
that also catches any future regression that re-introduces a null-writing path.
Behavioural impact: when a PublishVersionTask is force-finished without a
real BE response, the affected replica is now routed through the normal
errorReplicaIds / tabletWriteFailedReplicas branches instead of killing the
publish-thread worker with an NPE. The transaction either succeeds via
remaining quorum on healthy BEs, or fails cleanly via the standard
publish_version_timeout_second timeout.
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
### Release note
None
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
https://github.com/apache/doris-website/pull/1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR should
merge into -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]