maytasm opened a new pull request #9892:
URL: https://github.com/apache/druid/pull/9892
Fails creation of TaskResource if availabilityGroup is null
availabilityGroup in TaskResource should never be null. If a task has
availabilityGroup = null, it does not fails and still runs. However, while this
task is running, all other tasks will fails to start (and will be mark FAILED).
Once the task with availabilityGroup = null finish running, other tasks will
now be able to be created.
The error seen is (when other good task fails):
```
2020-05-04T04:19:59,129 ERROR [tasks-runner-0]
org.apache.druid.indexing.overlord.RemoteTaskRunner - Exception while trying to
assign task: {class=org.apache.druid.indexing.overlord.RemoteTaskRunner,
exceptionType=class java.lang.NullPointerException, exceptionMessage=at index
0, taskId=index_kafka_xxx}
java.lang.NullPointerException: at index 0
at
com.google.common.collect.ObjectArrays.checkElementNotNull(ObjectArrays.java:240)
~[guava-16.0.1.jar:?]
at
com.google.common.collect.ImmutableSet.construct(ImmutableSet.java:195)
~[guava-16.0.1.jar:?]
at com.google.common.collect.ImmutableSet.copyOf(ImmutableSet.java:375)
~[guava-16.0.1.jar:?]
at
org.apache.druid.indexing.overlord.ImmutableWorkerInfo.<init>(ImmutableWorkerInfo.java:59)
```
While availabilityGroup should never be null, we have seen in our
environment that task can appears with availabilityGroup = null. We have yet to
determined how this happened. Although one possible scenario is if the
ingestionSpec posted to Druid explicitly contains availabilityGroup set to null
(For example, "resource":{"availabilityGroup":null,"requiredCapacity":1}).
Regardless of how it happened, we should fails task that has
availabilityGroup = null by enforcing availabilityGroup not null on
TaskResource. We should do this for the following reason:
1) Bad task with availabilityGroup = null should not be able to affect other
good tasks running on the cluster (especially as those other tasks can be for
different datasource, etc.). Hence, the bad task with availabilityGroup = null
should fail fast instead of running normally.
2) We should fail fast so that we know exactly how / where TaskResource got
created with availabilityGroup = null. (to help with debug)
This PR has:
- [x] been self-reviewed.
- [ ] added documentation for new or modified features or behaviors.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]