[
https://issues.apache.org/jira/browse/YUNIKORN-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538553#comment-17538553
]
Saad Ur Rahman edited comment on YUNIKORN-1213 at 5/27/22 5:54 PM:
-------------------------------------------------------------------
I have begun work on this by adding an _*enabled*_ field to the
_*HealthChecker*_ struct.
For backward compatibility, we will likely have to retain the default check
interval of _30 sec_ if no entries are found in the {_}*ConfigMap*{_}.
The [parameterized
constructor|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L48]
need not access the _*ClusterContext*_ (I believe this is where the
_*ConfigMap*_ is parsed) and has the _*enabled=true*_ always.
The [unparameterized
constructor|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L41]
will need to get the data out of
[_*ClusterContext.partitions*_|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/partition.go#L45].
I think this means I will need to parse the _*enabled*_ and _*interval*_ out
of the _*ConfigMap*_ and into local variables in the *PartitionContext* struct
({*}healthCheckEnabled{*} and {*}healthCheckInterval{*}).
The subsequent issue is how we deal with a disabled check. The Go Routine to
run the health checks at an interval is this. We can either return immediately
is _*enabled=false*_ or let the first health check on *L58* run.
For the check routines, we can are likely going to need to populate the
[{*}dao.HealthCheckInfo{*}|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L119].
I think we should set the _*Description*_ to a string "Health Check Disabled"
with the _*DiagnosisMessage*_ of "Health checks are disabled in the ConfigMap
for this partition."
What are your thoughts? Sorry, I know I am clearing a lot of changes, but I
will be more autonomous once I am more acquainted with the codebase.
was (Author: surahman):
I have begun work on this by adding an _*enabled*_ field to the
_*HealthChecker*_ struct.
For backward compatibility, we will likely have to retain the default check
interval of _30 sec_ if no entries are found in the {_}*ConfigMap*{_}.
The [parameterized
constructor|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L48]
need not access the _*ClusterContext*_ (I believe this is where the
_*ConfigMap*_ is parsed) and has the _*enabled=true*_ always.
The [unparameterized
constructor|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L41]
will need to get the data out of
[{_}*ClusterContext.partitions*{_}|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/partition.go#L45].
I think this means I will need to parse the _*enabled*_ and _*interval*_ out
of the _*ConfigMap*_ and into local variables in the *PartitionContext* struct
({*}healthCheckEnabled{*} and {*}healthCheckInterval{*}).
The subsequent issue is how we deal with a disabled check. The Go Routine to
run the health checks at an interval is this. We can either return immediately
is _*enabled=false*_ or let the first health check on *L58* run.
For the check routines, we can are likely going to need to populate the
[{*}dao.HealthCheckInfo{*}|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L119.
I think we should set the _*Description*_ to a string "Health Check Disabled"
with the _*DiagnosisMessage*_ of "Health checks are disabled in the ConfigMap
for this partition."
What are your thoughts? Sorry, I know I am clearing a lot of changes, but I
will be more autonomous once I am more acquainted with the codebase.
> The interval of the background health checker needs to be configurable
> ----------------------------------------------------------------------
>
> Key: YUNIKORN-1213
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1213
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler
> Reporter: Weiwei Yang
> Assignee: Saad Ur Rahman
> Priority: Major
> Labels: pull-request-available
>
> YUNIKORN-1107 adds a background running health checker to verify the
> scheduler data correctness in the fixed time interval 30s:
> https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L34.
> We need to make this configurable, either let the user set a longer/shorter
> interval, or completely disable it.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]