YongGang commented on PR #16442: URL: https://github.com/apache/druid/pull/16442#issuecomment-2111716253
Thanks for your early review @kfaraz ! > - If the supervisor is not going to launch tasks, would it make sense to just move the supervisor to an idle state rather than introduce a new state? > > - Rather than introduce a new config, why not reuse the existing idleConfig (https://druid.apache.org/docs/latest/ingestion/kafka-ingestion/#idle-configuration). That was also introduced keeping in mind cost concerns and unnecessary task launches. We could just add a new field to idleConfig which dictates the supervisor to become idle if maxAllowedParseExceptions is violated. The primary distinction between the existing idle state and the newly introduced `UNHEALTHY_TASKS_STOP_CREATING_NEW` state is that the former represents a healthy condition, indicating that there are no further records to process. In contrast, `UNHEALTHY_TASKS_STOP_CREATING_NEW` is an unhealthy state. It can be seen as an extension of the existing `UNHEALTHY_TASKS` state, signifying not only that all tasks have failed or are unhealthy but also that no new tasks will be created due to the errors. This state requires intervention by a user or operator to restore it to a healthy condition. > - I am not sure if the "no progress" check is really going to be effective or if it is even desirable. There could be other reasons behind not making progress and not all of them would qualify for making the supervisor idle. The Supervisor will transition to the new `UNHEALTHY_TASKS_STOP_CREATING_NEW` state only if three conditions are met: "no progress" AND all failed task are due to parsing exceptions AND the count of failed task exceed `TaskUnhealthinessThreshold`. Under these circumstances, we can attribute the parsing exceptions as the root cause of the Supervisor's lack of progress. > - What if maxAllowedParseExceptions keeps getting violated for some partitions and not for others. The task launched for the former partitions would always fail but should we become idle in this case or just not launch tasks to read from the bad partitions? In this case that means Supervisor is making progress, we won't stop Supervisor from launching new tasks. Because transitioning to the `UNHEALTHY_TASKS_STOP_CREATING_NEW` state results in a halt to new task initiation, we approach this transition with great caution to ensure it is absolutely necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
