github-actions[bot] commented on code in PR #64423:
URL: https://github.com/apache/doris/pull/64423#discussion_r3401251027
##########
fe/fe-common/src/main/java/org/apache/doris/job/cdc/request/WriteRecordRequest.java:
##########
@@ -31,4 +31,6 @@ public class WriteRecordRequest extends JobBaseRecordRequest {
private String token;
private String taskId;
private Map<String, String> streamLoadProps;
+ // previous task ended abnormally, rebuild reader instead of reusing
+ private boolean rebuildReader;
Review Comment:
Using a primitive boolean makes old FE requests deserialize as
`rebuildReader=false`. During a rolling upgrade, a new cdc_client can receive
`/api/writeRecords` from an old FE that still selects BEs round-robin and never
sends this field. Because this PR also stops finishing binlog readers in
`cleanupReaderResources()`, those old payloads can leave and later reuse live
readers on multiple BEs for the same job/PG slot/MySQL stream. Please make
absence mean legacy/no-reuse behavior, for example a nullable `Boolean` where
`null` forces rebuild, or add an explicit reuse-enabled flag that only the new
FE sends after it has bound the job to one BE.
##########
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java:
##########
@@ -770,6 +793,20 @@ public void clearRunningStreamTask(JobStatus newJobStatus)
{
}
}
+ // Command entry for a manual status change: reset the failure/retry
budget, and on manual pause
+ // release the reader (keep slot). "Manual" is decided by the caller,
never by reading failureReason.
+ public void onManualStatusAltered(JobStatus newStatus, FailureReason
reason) {
+ lock.writeLock().lock();
Review Comment:
Manual PAUSE can be followed immediately by RESUME while this
fire-and-forget release RPC is still in flight or has failed. After a
successful round `needRebuildReader` is false, so the resumed task sends
`rebuildReader=false`; if `/api/releaseReader` has not detached the old owner
yet, `getReaderAndClaim()` only changes `ownerTaskId` and returns the same live
`SourceReader`. The canceled task can still be polling that reader, so the new
task can concurrently reuse the same stream reader and mix or duplicate records
before the old task observes it lost ownership. Please mark the job as needing
a rebuild as part of manual pause, or otherwise wait for/deterministically
complete detach, so the first resumed task swaps out the old reader even when
release races or fails.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]