georgew5656 opened a new pull request, #15174:
URL: https://github.com/apache/druid/pull/15174

   Currently it's possible for there to be unnecessary task failures during 
streaming ingestion after tasks are launched and after tasks finish handoff 
(right before exit).
   
   The errors occur when a peon is running (e.g. the jetty server is 
responding), but there is no chat handler registered for the task. 
SeekableStreamIndexTaskRunner is responsible for registering and deregistering 
the chat handler, but this is technically not synced up exactly with when the 
jetty server starts.
   
   If the chat handler is not registered, the ChatHandlerResource throws a 400 
(non-retryable) error. IMO this error should be a ServiceUnavailable error 
because the resource is per task,  so a missing handler indicates that the task 
is setting up or tearing down and can be retried.
   
   If this ChatHandlerResource throws a 400 in response to a supervisor's 
getStatusAsync call, the supervisor will attempt to kill the task even though 
it may be in the process of starting up or shutting down successfully. The task 
will fail with "Task [%s] failed to return status, killing task", even though 
it may have been able to run successfully.
   
   Updating the errror code from 400 -> 503 directly fixes the issue during 
startup, since the supervisor will retry the 503 error according to its 
retryPolicy and will eventually succeed when the task comes up.
   
   For the case where a task is shutting down, the client will continue 
retrying the 503 exceptions (either from the ChatHandlerResource or from when 
Jetty itself has been torn down) 
(https://github.com/apache/druid/blob/0a27a7a7ca1561630730c9288e5b36ead1268aed/server/src/main/java/org/apache/druid/rpc/ServiceClientImpl.java#L115).
 Eventually the task will be reported by the task runner as exited and this 
will be reflected in taskStorage. The client getStatusAsync call will stop 
retrying at this point and the future will fail 
(https://github.com/apache/druid/blob/0a27a7a7ca1561630730c9288e5b36ead1268aed/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskClientAsyncImpl.java#L661).
   
   To handle this case, I added code in the supervisor logic to check 
taskStorage before killing a unresponsive task. If the task has been marked as 
complete in taskStorage, we can conclude that we were trying to contact a task 
that was in the process of cleaning up and we don't need to do anything else.
   
   #### Release note
   Cleanup race condition that can occur at high streaming concurrency.
   
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `SeekableStreamSupervisor`
    * `ChatHandlerResource`
   
   
   This PR has:
   
   - [X] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [X] been tested in a test Druid cluster.
   
   
   I tested this quite a bit in a test environment with quite a few supervisors 
(300). I am not sure how to go about writing a integration test to test this 
logic though, open to any suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to