Peter Bacsko created YUNIKORN-1179:
--------------------------------------
Summary: Logs are spammed with health check status messages
Key: YUNIKORN-1179
URL: https://issues.apache.org/jira/browse/YUNIKORN-1179
Project: Apache YuniKorn
Issue Type: Bug
Components: core - scheduler
Reporter: Peter Bacsko
YUNIKORN-1107 introduced periodic background health check.
The problem is, too much noise is printed to the console:
{noformat}
2022-04-20T13:28:03.101Z INFO scheduler/health_checker.go:87
Scheduler is healthy {"health check values": [{"Name":"Scheduling
errors","Succeeded":true,"Description":"Check for scheduling error entries in
metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the
metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for
failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed nodes
logged in the metrics"},{"Name":"Negative
resources","Succeeded":true,"Description":"Check for negative resources in the
partitions","DiagnosisMessage":"Partitions with negative resources:
[]"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for
negative resources in the nodes","DiagnosisMessage":"Nodes with negative
resources: []"},{"Name":"Consistency of
data","Succeeded":true,"Description":"Check if a node's allocated resource <=
total resource of the node","DiagnosisMessage":"Nodes with inconsistent data:
[]"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if
total partition resource == sum of the node resources from the
partition","DiagnosisMessage":"Partitions with inconsistent data:
[]"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if
node total resource = allocated resource + occupied resource + available
resource","DiagnosisMessage":"Nodes with inconsistent data:
[]"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if
node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes
with inconsistent data: []"},{"Name":"Reservation
check","Succeeded":true,"Description":"Check the reservation nr compared to the
number of nodes","DiagnosisMessage":"Reservation/node nr ratio:
[0.000000]"},{"Name":"Orphan allocation on node
check","Succeeded":true,"Description":"Check if there are orphan allocations on
the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan
allocation on app check","Succeeded":true,"Description":"Check if there are
orphan allocations on the applications","DiagnosisMessage":"OrphanAllocations:
[]"}]}
2022-04-20T13:28:33.098Z INFO scheduler/health_checker.go:87
Scheduler is healthy {"health check values": [{"Name":"Scheduling
errors","Succeeded":true,"Description":"Check for scheduling error entries in
metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the
metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for
failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed nodes
logged in the metrics"},{"Name":"Negative
resources","Succeeded":true,"Description":"Check for negative resources in the
partitions","DiagnosisMessage":"Partitions with negative resources:
[]"},{"Name":"Negative resources","Succeeded":true,"Description":"Check for
negative resources in the nodes","DiagnosisMessage":"Nodes with negative
resources: []"},{"Name":"Consistency of
data","Succeeded":true,"Description":"Check if a node's allocated resource <=
total resource of the node","DiagnosisMessage":"Nodes with inconsistent data:
[]"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if
total partition resource == sum of the node resources from the
partition","DiagnosisMessage":"Partitions with inconsistent data:
[]"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if
node total resource = allocated resource + occupied resource + available
resource","DiagnosisMessage":"Nodes with inconsistent data:
[]"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if
node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes
with inconsistent data: []"},{"Name":"Reservation
check","Succeeded":true,"Description":"Check the reservation nr compared to the
number of nodes","DiagnosisMessage":"Reservation/node nr ratio:
[0.000000]"},{"Name":"Orphan allocation on node
check","Succeeded":true,"Description":"Check if there are orphan allocations on
the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan
allocation on app check","Succeeded":true,"Description":"Check if there are
orphan allocations on the applications","DiagnosisMessage":"OrphanAllocations:
[]"}]}
{noformat}
I don't think we need that much output in every 30 seconds. In fact, if the
scheduler is healthy, we don't need anything at all, maybe a short printout on
DEBUG level, but nothing more.
If the health check failed, then we might log it, but even in that case this
looks unnecessary.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]