[PR] Handle status failures for streaming supervisors (druid)

via GitHub Mon, 16 Oct 2023 14:17:22 -0700


georgew5656 opened a new pull request, #15174:
URL: https://github.com/apache/druid/pull/15174

Currently it's possible for there to be unnecessary task failures during
streaming ingestion after tasks are launched and after tasks finish handoff
(right before exit).

The errors occur when a peon is running (e.g. the jetty server is
responding), but there is no chat handler registered for the task.
SeekableStreamIndexTaskRunner is responsible for registering and deregistering
the chat handler, but this is technically not synced up exactly with when the
jetty server starts.

If the chat handler is not registered, the ChatHandlerResource throws a 400
(non-retryable) error. IMO this error should be a ServiceUnavailable error
because the resource is per task, so a missing handler indicates that the task
is setting up or tearing down and can be retried.

If this ChatHandlerResource throws a 400 in response to a supervisor's
getStatusAsync call, the supervisor will attempt to kill the task even though
it may be in the process of starting up or shutting down successfully. The task
will fail with "Task [%s] failed to return status, killing task", even though
it may have been able to run successfully.

Updating the errror code from 400 -> 503 directly fixes the issue during
startup, since the supervisor will retry the 503 error according to its
retryPolicy and will eventually succeed when the task comes up.

For the case where a task is shutting down, the client will continue
retrying the 503 exceptions (either from the ChatHandlerResource or from when
Jetty itself has been torn down)
(https://github.com/apache/druid/blob/0a27a7a7ca1561630730c9288e5b36ead1268aed/server/src/main/java/org/apache/druid/rpc/ServiceClientImpl.java#L115).
Eventually the task will be reported by the task runner as exited and this
will be reflected in taskStorage. The client getStatusAsync call will stop
retrying at this point and the future will fail
(https://github.com/apache/druid/blob/0a27a7a7ca1561630730c9288e5b36ead1268aed/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskClientAsyncImpl.java#L661).

To handle this case, I added code in the supervisor logic to check
taskStorage before killing a unresponsive task. If the task has been marked as
complete in taskStorage, we can conclude that we were trying to contact a task
that was in the process of cleaning up and we don't need to do anything else.

#### Release note
Cleanup race condition that can occur at high streaming concurrency.

<hr>

##### Key changed/added classes in this PR
* `SeekableStreamSupervisor`
* `ChatHandlerResource`

This PR has:

- [X] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [X] been tested in a test Druid cluster.

I tested this quite a bit in a test environment with quite a few supervisors
(300). I am not sure how to go about writing a integration test to test this
logic though, open to any suggestions.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Handle status failures for streaming supervisors (druid)

Reply via email to