[jira] [Updated] (YUNIKORN-2292) Flaky E2E Test: Orphan pods still exist after TearDownNamespace()

Yu-Lin Chen (Jira) Fri, 29 Dec 2023 01:31:07 -0800


     [ 
https://issues.apache.org/jira/browse/YUNIKORN-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yu-Lin Chen updated YUNIKORN-2292:
----------------------------------
    Description: 
The current [TearDownNamespace() 
|https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/framework/helpers/k8s/k8s_utils.go#L393]function
 delete pods before deleting namesapce. However, there might be a Job still 
alive that can recreate the deleted Pods. The incompleted cleanup will cause 
other e2e test to fail.

For example: This failed E2E test ([PR 
#742|https://github.com/apache/yunikorn-k8shim/actions/runs/7078108545/job/19271866521?pr=742#step:6:3952])
 was caused by an orphan pod, which was created by a Job submitted in 
[recovery_and_restart_test.go#L339|https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/recovery_and_restart/recovery_and_restart_test.go#L339]
 .

{*}Time series{*}:
 * 2023-12-04T04:39:38.4083946Z Submit gang job. (recovery_and_restart)
 * 2023-12-04T04:39:46.8767068Z {color:#de350b}Tear down namespace{color}: 
devjyy8x (Delete pod → Delete Namesapce) 
 * 2023-12-04T04:39:47.0512900Z Tear down namespace completed 
({color:#de350b}One orphan remains{color})
 * 2023-12-04T04:39:50.5110355Z Restart the scheduler pod. (resource_fairness)
 * 2023-12-04T04:39:52.178Z ERROR Failed to add application to partition 
(placement rejected)
→ {color:#de350b}The orphan pod led to node removal in scheduler after 
recovery. {color}
 * 2023-12-04T04:41:35.9084137Z test faield. (simple_preemptor e2e )
 * 2023-12-04T04:41:36.3406477Z Cluster dump output shows 1 orphan pod with age 
110s ({color:#de350b}The pod was recreated when running 
TearDownNamespace(){color} )

{*}Solution{*}:
In addition to calling deletePod, we should also check and delete Jobs in 
[TearDownNamespace() 
|https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/framework/helpers/k8s/k8s_utils.go#L393]

  was:
The current [TearDownNamespace() 
|https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/framework/helpers/k8s/k8s_utils.go#L393]function
 delete pods before deleting namesapce. However, there might be a Job still 
alive that can recreate the deleted Pods. The incompleted cleanup will cause 
other e2e test to fail.

For example: This failed E2E test ([PR 
#742|https://github.com/apache/yunikorn-k8shim/actions/runs/7078108545/job/19271866521?pr=742#step:6:3952])
 was caused by an orphan pod, which was created by a Job submitted in 
[recovery_and_restart_test.go#L339|https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/recovery_and_restart/recovery_and_restart_test.go#L339]
 .

{*}Time series{*}:
 * 2023-12-04T04:39:38.4083946Z Submit gang job. (recovery_and_restart)
 * 2023-12-04T04:39:46.8767068Z {color:#de350b}Tear down namespace{color}: 
devjyy8x (Delete pod → Delete Namesapce) 
 * 2023-12-04T04:39:47.0512900Z Tear down namespace completed (One orphan 
remains)
 * 2023-12-04T04:39:50.5110355Z Restart the scheduler pod. (resource_fairness)
 * 2023-12-04T04:39:52.178Z ERROR Failed to add application to partition 
(placement rejected)
→ {color:#de350b}The orphan pod led to node removal in scheduler after 
recovery. {color}
 * 2023-12-04T04:41:35.9084137Z test faield. (simple_preemptor e2e )
 * 2023-12-04T04:41:36.3406477Z Cluster dump output shows 1 orphan pod with age 
110s ({color:#de350b}The pod was recreated when running 
TearDownNamespace(){color} )

{*}Solution{*}:
In addition to calling deletePod, we should also check and delete Jobs in 
[TearDownNamespace() 
|https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/framework/helpers/k8s/k8s_utils.go#L393]


> Flaky E2E Test: Orphan pods still exist after TearDownNamespace()
> -----------------------------------------------------------------
>
>                 Key: YUNIKORN-2292
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2292
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: test - e2e
>            Reporter: Yu-Lin Chen
>            Assignee: Yu-Lin Chen
>            Priority: Major
>
> The current [TearDownNamespace() 
> |https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/framework/helpers/k8s/k8s_utils.go#L393]function
>  delete pods before deleting namesapce. However, there might be a Job still 
> alive that can recreate the deleted Pods. The incompleted cleanup will cause 
> other e2e test to fail.
> For example: This failed E2E test ([PR 
> #742|https://github.com/apache/yunikorn-k8shim/actions/runs/7078108545/job/19271866521?pr=742#step:6:3952])
>  was caused by an orphan pod, which was created by a Job submitted in 
> [recovery_and_restart_test.go#L339|https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/recovery_and_restart/recovery_and_restart_test.go#L339]
>  .
> {*}Time series{*}:
>  * 2023-12-04T04:39:38.4083946Z Submit gang job. (recovery_and_restart)
>  * 2023-12-04T04:39:46.8767068Z {color:#de350b}Tear down namespace{color}: 
> devjyy8x (Delete pod → Delete Namesapce) 
>  * 2023-12-04T04:39:47.0512900Z Tear down namespace completed 
> ({color:#de350b}One orphan remains{color})
>  * 2023-12-04T04:39:50.5110355Z Restart the scheduler pod. (resource_fairness)
>  * 2023-12-04T04:39:52.178Z ERROR Failed to add application to partition 
> (placement rejected)
> → {color:#de350b}The orphan pod led to node removal in scheduler after 
> recovery. {color}
>  * 2023-12-04T04:41:35.9084137Z test faield. (simple_preemptor e2e )
>  * 2023-12-04T04:41:36.3406477Z Cluster dump output shows 1 orphan pod with 
> age 110s ({color:#de350b}The pod was recreated when running 
> TearDownNamespace(){color} )
> {*}Solution{*}:
> In addition to calling deletePod, we should also check and delete Jobs in 
> [TearDownNamespace() 
> |https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/framework/helpers/k8s/k8s_utils.go#L393]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YUNIKORN-2292) Flaky E2E Test: Orphan pods still exist after TearDownNamespace()

Reply via email to