Peter Bacsko created YUNIKORN-1179:
--------------------------------------

             Summary: Logs are spammed with health check status messages
                 Key: YUNIKORN-1179
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1179
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: core - scheduler
            Reporter: Peter Bacsko


YUNIKORN-1107 introduced periodic background health check.

The problem is, too much noise is printed to the console:
{noformat}
2022-04-20T13:28:03.101Z        INFO    scheduler/health_checker.go:87  
Scheduler is healthy    {"health check values": [{"Name":"Scheduling 
errors","Succeeded":true,"Description":"Check for scheduling error entries in 
metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the 
metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for 
failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed nodes 
logged in the metrics"},{"Name":"Negative 
resources","Succeeded":true,"Description":"Check for negative resources in the 
partitions","DiagnosisMessage":"Partitions with negative resources: 
[]"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for 
negative resources in the nodes","DiagnosisMessage":"Nodes with negative 
resources: []"},{"Name":"Consistency of 
data","Succeeded":true,"Description":"Check if a node's allocated resource <= 
total resource of the node","DiagnosisMessage":"Nodes with inconsistent data: 
[]"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if 
total partition resource == sum of the node resources from the 
partition","DiagnosisMessage":"Partitions with inconsistent data: 
[]"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if 
node total resource = allocated resource + occupied resource + available 
resource","DiagnosisMessage":"Nodes with inconsistent data: 
[]"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if 
node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes 
with inconsistent data: []"},{"Name":"Reservation 
check","Succeeded":true,"Description":"Check the reservation nr compared to the 
number of nodes","DiagnosisMessage":"Reservation/node nr ratio: 
[0.000000]"},{"Name":"Orphan allocation on node 
check","Succeeded":true,"Description":"Check if there are orphan allocations on 
the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan 
allocation on app check","Succeeded":true,"Description":"Check if there are 
orphan allocations on the applications","DiagnosisMessage":"OrphanAllocations: 
[]"}]}
2022-04-20T13:28:33.098Z        INFO    scheduler/health_checker.go:87  
Scheduler is healthy    {"health check values": [{"Name":"Scheduling 
errors","Succeeded":true,"Description":"Check for scheduling error entries in 
metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the 
metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for 
failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed nodes 
logged in the metrics"},{"Name":"Negative 
resources","Succeeded":true,"Description":"Check for negative resources in the 
partitions","DiagnosisMessage":"Partitions with negative resources: 
[]"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for 
negative resources in the nodes","DiagnosisMessage":"Nodes with negative 
resources: []"},{"Name":"Consistency of 
data","Succeeded":true,"Description":"Check if a node's allocated resource <= 
total resource of the node","DiagnosisMessage":"Nodes with inconsistent data: 
[]"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if 
total partition resource == sum of the node resources from the 
partition","DiagnosisMessage":"Partitions with inconsistent data: 
[]"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if 
node total resource = allocated resource + occupied resource + available 
resource","DiagnosisMessage":"Nodes with inconsistent data: 
[]"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if 
node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes 
with inconsistent data: []"},{"Name":"Reservation 
check","Succeeded":true,"Description":"Check the reservation nr compared to the 
number of nodes","DiagnosisMessage":"Reservation/node nr ratio: 
[0.000000]"},{"Name":"Orphan allocation on node 
check","Succeeded":true,"Description":"Check if there are orphan allocations on 
the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan 
allocation on app check","Succeeded":true,"Description":"Check if there are 
orphan allocations on the applications","DiagnosisMessage":"OrphanAllocations: 
[]"}]}
{noformat}

I don't think we need that much output in every 30 seconds. In fact, if the 
scheduler is healthy, we don't need anything at all, maybe a short printout on 
DEBUG level, but nothing more.

If the health check failed, then we might log it, but even in that case this 
looks unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to