Yu-Lin Chen created YUNIKORN-1998:
-------------------------------------
Summary: Stale AdmissionControllerConf was used in e2e test
Key: YUNIKORN-1998
URL: https://issues.apache.org/jira/browse/YUNIKORN-1998
Project: Apache YuniKorn
Issue Type: Bug
Components: test - e2e
Reporter: Yu-Lin Chen
In e2e test for "user_group_limit", we updated Yunikorn ConfigMap before submit
sleep pods.
[https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/user_group_limit/user_group_limit_test.go#L67-L70]
However, the AdmissionControllerConf in Admission controller doesn't
immediately reflect the changes. The time gap leads following error in CI flow:
[https://github.com/apache/yunikorn-k8shim/actions/runs/6282144385/job/17062987755?pr=677#step:5:2346]
We need to find a way to ensure AdmissionControllerConf has been updated before
e2e test submits a new pods.
This issue can be reproduced if we introduce 1 second when the admission
controller updates the AdmissionControllerConf. Here is the example code.
{code:java}
func (h *configMapUpdateHandler) OnUpdate(_, newObj interface{}) {
cm := utils.Convert2ConfigMap(newObj)
// sleep 1 to deply AdmissionControllerConf update
time.Sleep(1 * time.Second)
if idx, ok := h.configMapIndex(cm); ok {
h.conf.configUpdated(idx, cm)
}
} {code}
[https://github.com/apache/yunikorn-k8shim/blob/master/pkg/admission/conf/am_conf.go#L237-L238]
Below are the before/after AM logs when we added 1 sec delay:
Success without delay 1 sec:
* (AM configMapUpdateHandler) 05:36:39.184Z.
AdmissionControllerConf trying to upgrade config
* (AM configMapUpdateHandler) 05:36:39.185Z .
AdmissionControllerConf config upgraded
* (AM Webhook) 05:36:39.218Z. AM
receive AdmissionReview request
* (AM Webhook) 05:36:39.221Z. AM check
Pods with new config, E2E Test Passed
Failed with delay 1 sec:
* (AM configMapUpdateHandler) 08:19:31.025Z.
AdmissionControllerConf trying to upgrade config
* (AM Webhook) 08:19:31.067Z. AM
receive AdmissionReview request
* (AM Webhook) 08:19:31.069Z
{color:#FF0000}*AM check Pods with stale config*{color}, {color:#FF0000}*E2E
Test failed* {color}
* (AM configMapUpdateHandler) 08:19:32.026Z.
AdmissionControllerConf config upgraded
In my kind cluster(v1.24.15), there is only 0.033 sec gap. It's possible that
the admission controller in CI flow runs those steps in different order.
The possible Solution:
1. In the e2e test, sleep for 1 seconds between updating the configmap and
submitting new sleep pods. (It's quick fix, I assumed the time gap is always
less than 1 sec )
2. In the e2e test, check current AdmissionControllerConf's value before submit
a new pod. (How do client dump current AdmissionControllerConf? Need to seek
for more advice.)
Since this issue only impact e2e test for now, we can go with solution #1 as a
quick fix. But it'll be better if we allow client to check whether
AdmissionControllerConf is up-to-date.
Please kindly let me know if I have any misunderstandings.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]