[ 
https://issues.apache.org/jira/browse/YUNIKORN-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538553#comment-17538553
 ] 

Saad Ur Rahman edited comment on YUNIKORN-1213 at 5/27/22 5:54 PM:
-------------------------------------------------------------------

I have begun work on this by adding an _*enabled*_ field to the 
_*HealthChecker*_ struct.

For backward compatibility, we will likely have to retain the default check 
interval of _30 sec_ if no entries are found in the {_}*ConfigMap*{_}.

The [parameterized 
constructor|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L48]
 need not access the _*ClusterContext*_ (I believe this is where the 
_*ConfigMap*_ is parsed) and has the _*enabled=true*_ always.

The [unparameterized 
constructor|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L41]
 will need to get the data out of 
[_*ClusterContext.partitions*_|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/partition.go#L45].
 I think this means I will need to parse the _*enabled*_ and _*interval*_ out 
of the _*ConfigMap*_ and into local variables in the *PartitionContext* struct 
({*}healthCheckEnabled{*} and {*}healthCheckInterval{*}).

The subsequent issue is how we deal with a disabled check. The Go Routine to 
run the health checks at an interval is this. We can either return immediately 
is _*enabled=false*_ or let the first health check on *L58* run.

For the check routines, we can are likely going to need to populate the 
[{*}dao.HealthCheckInfo{*}|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L119].
 I think we should set the _*Description*_ to a string "Health Check Disabled" 
with the _*DiagnosisMessage*_ of "Health checks are disabled in the ConfigMap 
for this partition."

What are your thoughts? Sorry, I know I am clearing a lot of changes, but I 
will be more autonomous once I am more acquainted with the codebase.


was (Author: surahman):
I have begun work on this by adding an _*enabled*_ field to the 
_*HealthChecker*_ struct.

For backward compatibility, we will likely have to retain the default check 
interval of _30 sec_ if no entries are found in the {_}*ConfigMap*{_}.

The [parameterized 
constructor|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L48]
 need not access the _*ClusterContext*_ (I believe this is where the 
_*ConfigMap*_ is parsed) and has the _*enabled=true*_ always.

The [unparameterized 
constructor|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L41]
 will need to get the data out of 
[{_}*ClusterContext.partitions*{_}|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/partition.go#L45].
 I think this means I will need to parse the _*enabled*_ and _*interval*_ out 
of the _*ConfigMap*_ and into local variables in the *PartitionContext* struct 
({*}healthCheckEnabled{*} and {*}healthCheckInterval{*}).

The subsequent issue is how we deal with a disabled check. The Go Routine to 
run the health checks at an interval is this. We can either return immediately 
is _*enabled=false*_ or let the first health check on *L58* run.

For the check routines, we can are likely going to need to populate the 
[{*}dao.HealthCheckInfo{*}|https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L119.
 I think we should set the _*Description*_ to a string "Health Check Disabled" 
with the _*DiagnosisMessage*_ of "Health checks are disabled in the ConfigMap 
for this partition."

What are your thoughts? Sorry, I know I am clearing a lot of changes, but I 
will be more autonomous once I am more acquainted with the codebase.

> The interval of the background health checker needs to be configurable
> ----------------------------------------------------------------------
>
>                 Key: YUNIKORN-1213
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1213
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Weiwei Yang
>            Assignee: Saad Ur Rahman
>            Priority: Major
>              Labels: pull-request-available
>
> YUNIKORN-1107 adds a background running health checker to verify the 
> scheduler data correctness in the fixed time interval 30s: 
> https://github.com/apache/yunikorn-core/blob/3ba91fb8a41c0fd0dd6243326e583dea5167199f/pkg/scheduler/health_checker.go#L34.
>  We need to make this configurable, either let the user set a longer/shorter 
> interval, or completely disable it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to