maytasm opened a new pull request #9892:
URL: https://github.com/apache/druid/pull/9892


   Fails creation of TaskResource if availabilityGroup is null
   
   availabilityGroup in TaskResource should never be null. If a task has 
availabilityGroup = null, it does not fails and still runs. However, while this 
task is running, all other tasks will fails to start (and will be mark FAILED). 
Once the task with availabilityGroup = null finish running, other tasks will 
now be able to be created. 
   The error seen is (when other good task fails):
   ```
   2020-05-04T04:19:59,129 ERROR [tasks-runner-0] 
org.apache.druid.indexing.overlord.RemoteTaskRunner - Exception while trying to 
assign task: {class=org.apache.druid.indexing.overlord.RemoteTaskRunner, 
exceptionType=class java.lang.NullPointerException, exceptionMessage=at index 
0, taskId=index_kafka_xxx}
   java.lang.NullPointerException: at index 0
        at 
com.google.common.collect.ObjectArrays.checkElementNotNull(ObjectArrays.java:240)
 ~[guava-16.0.1.jar:?]
        at 
com.google.common.collect.ImmutableSet.construct(ImmutableSet.java:195) 
~[guava-16.0.1.jar:?]
        at com.google.common.collect.ImmutableSet.copyOf(ImmutableSet.java:375) 
~[guava-16.0.1.jar:?]
        at 
org.apache.druid.indexing.overlord.ImmutableWorkerInfo.<init>(ImmutableWorkerInfo.java:59)
   ```
   
   While availabilityGroup should never be null, we have seen in our 
environment that task can appears with availabilityGroup = null. We have yet to 
determined how this happened. Although one possible scenario is if the 
ingestionSpec posted to Druid explicitly contains availabilityGroup set to null 
(For example, "resource":{"availabilityGroup":null,"requiredCapacity":1}).
   
   Regardless of how it happened, we should fails task that has 
availabilityGroup = null by enforcing availabilityGroup not null on 
TaskResource. We should do this for the following reason:
   1) Bad task with availabilityGroup = null should not be able to affect other 
good tasks running on the cluster (especially as those other tasks can be for 
different datasource, etc.). Hence, the bad  task with availabilityGroup = null 
should fail fast instead of running normally.
   2) We should fail fast so that we know exactly how / where TaskResource got 
created with availabilityGroup = null. (to help with debug)
   
   
   This PR has:
   - [x] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to