YongGang commented on PR #16442:
URL: https://github.com/apache/druid/pull/16442#issuecomment-2111716253

   Thanks for your early review @kfaraz !
   
   > - If the supervisor is not going to launch tasks, would it make sense to 
just move the supervisor to an idle state rather than introduce a new state?
   > 
   > - Rather than introduce a new config, why not reuse the existing 
idleConfig 
(https://druid.apache.org/docs/latest/ingestion/kafka-ingestion/#idle-configuration).
 That was also introduced keeping in mind cost concerns and unnecessary task 
launches. We could just add a new field to idleConfig which dictates the 
supervisor to become idle if maxAllowedParseExceptions is violated.
   
   The primary distinction between the existing idle state and the newly 
introduced `UNHEALTHY_TASKS_STOP_CREATING_NEW` state is that the former 
represents a healthy condition, indicating that there are no further records to 
process. In contrast, `UNHEALTHY_TASKS_STOP_CREATING_NEW` is an unhealthy 
state. It can be seen as an extension of the existing `UNHEALTHY_TASKS` state, 
signifying not only that all tasks have failed or are unhealthy but also that 
no new tasks will be created due to the errors. This state requires 
intervention by a user or operator to restore it to a healthy condition.
   
   > - I am not sure if the "no progress" check is really going to be effective 
or if it is even desirable. There could be other reasons behind not making 
progress and not all of them would qualify for making the supervisor idle.
   
   The Supervisor will transition to the new 
`UNHEALTHY_TASKS_STOP_CREATING_NEW` state only if three conditions are met: "no 
progress" AND all failed task are due to parsing exceptions AND the count of 
failed task exceed `TaskUnhealthinessThreshold`. Under these circumstances, we 
can attribute the parsing exceptions as the root cause of the Supervisor's lack 
of progress.
   
   > - What if maxAllowedParseExceptions keeps getting violated for some 
partitions and not for others. The task launched for the former partitions 
would always fail but should we become idle in this case or just not launch 
tasks to read from the bad partitions?
   
   In this case that means Supervisor is making progress, we won't stop 
Supervisor from launching new tasks. Because transitioning to the 
`UNHEALTHY_TASKS_STOP_CREATING_NEW` state results in a halt to new task 
initiation, we approach this transition with great caution to ensure it is 
absolutely necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to