kfaraz opened a new pull request, #19091:
URL: https://github.com/apache/druid/pull/19091

   ### Description
   
   Follow up to #19034 , which addressed the issue of using the correct 
starting sequences after a scaling event occurs.
   
   Discovered another race condition which causes task failures:
   ```
   First failure:
   Killing task[A] as its checkpoints[...] are not consistent with group 
checkpoints[...] or latest offsets in DB[...].
   
   Second failure:
   Stored metadata state[...] has already been updated by other tasks and has 
diverged from the start metadata state[...].
   ```
   
   This typically plays out as follows:
   - Scaling event occurs
   - New task group B is created and is assigned a partition P1 which an old 
task group A (still pending completion) was also reading from.
   - Task group A is requested to checkpoint and start publishing.
   - Task group A returns the final end offset as E1.
   - Task group B is assigned starting offset as E1
   - Next invocation of `runInternal()` calls `verifyAndMergeCheckpoints()`
   - Task group B is found to have offsets (E1) which are inconsistent with DB 
(since A is yet to publish E1 and DB is still at E0).
   - Task group B is killed
   - New task group C is launched with starting offsets E0
   - Task group A finishes publishing and updates offsets in metadata store 
from E0 to E1
   - Task group C also fails while publishing since offsets have already 
advanced from what C knew (i.e. E0)
   
   There is already a check to not kill task group B if there are some 
`pendingCompletionTaskGroups` but only for the same group B. When scaling 
happens, partitions are reassigned and tasks from a different group A may also 
end up updating the offsets that B is reading from.
   
   ### Fix
   
   While performing `verifyAndMergeCheckpoints`, consider a task eligible for 
checkpoint verification only if no other task group (irrespective of groupId) 
is waiting to publish to any of the partitions from which a task is reading.
   
   This still feels like a temporary fix.
   
   A cleaner, more long-term fix would be to rethink the way `TaskGroup` is 
handled inside `SeekableStreamSupervisor`, so that it lends itself to scaling 
and partition reassignments better.
   
   <hr>
   
   This PR has:
   
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to