[GitHub] [incubator-yunikorn-k8shim] anuraagnalluri commented on a change in pull request #369: [YUNIKORN-1040] add e2e test that re-starts the scheduler pod

GitBox Tue, 15 Feb 2022 10:53:04 -0800


anuraagnalluri commented on a change in pull request #369:
URL: 
https://github.com/apache/incubator-yunikorn-k8shim/pull/369#discussion_r806327376




##########
File path: test/e2e/basic_scheduling/basic_scheduling_test.go
##########
@@ -70,6 +70,16 @@ var _ = ginkgo.Describe("", func() {
                Ω(err3).NotTo(HaveOccurred())
                Ω(d).NotTo(BeNil())
 
+               ginkgo.By("Restart scheduler pod")
+               _, err4 := kClient.ScaleDeployment(configmanager.YKScheduler, 
0, configmanager.YuniKornTestConfig.YkNamespace)
+               gomega.Ω(err4).NotTo(gomega.HaveOccurred())
+               err5 := 
kClient.WaitForPodBySelectorTerminated(configmanager.YuniKornTestConfig.YkNamespace,
 fmt.Sprintf("component=%s", configmanager.YKScheduler), 60)

Review comment:
       I'm really not sure why this timeout had to be 60. Waiting for pods to 
be fully terminated (this does not correspond to "terminating" in k8s state 
diagram, but rather failing an existence check for any pods with the `selector` 
labels in the given `namespace`) takes around 10-12 seconds. But even setting 
30 here leads to a `timeout exceeded` error. Wasn't sure how to debug this or 
estimate the value more properly.  

##########
File path: test/e2e/basic_scheduling/basic_scheduling_test.go
##########
@@ -70,6 +70,16 @@ var _ = ginkgo.Describe("", func() {
                Ω(err3).NotTo(HaveOccurred())
                Ω(d).NotTo(BeNil())
 
+               ginkgo.By("Restart scheduler pod")
+               _, err4 := kClient.ScaleDeployment(configmanager.YKScheduler, 
0, configmanager.YuniKornTestConfig.YkNamespace)

Review comment:
       Note k8s has no API call to bounce a pod. I'm simply scaling the 
replicas in the scheduler deployment to 0 and then back to 1.

##########
File path: test/e2e/basic_scheduling/basic_scheduling_test.go
##########
@@ -70,6 +70,16 @@ var _ = ginkgo.Describe("", func() {
                Ω(err3).NotTo(HaveOccurred())
                Ω(d).NotTo(BeNil())
 
+               ginkgo.By("Restart scheduler pod")
+               _, err4 := kClient.ScaleDeployment(configmanager.YKScheduler, 
0, configmanager.YuniKornTestConfig.YkNamespace)

Review comment:
       Note k8s has no API call to bounce a pod. We're simply scaling the 
replicas in the scheduler deployment to 0 and then back to 1.

##########
File path: test/e2e/framework/helpers/k8s/k8s_utils.go
##########
@@ -295,14 +308,19 @@ func (k *KubeCtl) ListPods(namespace string, selector 
string) (*v1.PodList, erro
 }
 
 // Wait up to timeout seconds for all pods in 'namespace' with given 
'selector' to enter running state.
-// Returns an error if no pods are found or not all discovered pods enter 
running state.
-func (k *KubeCtl) WaitForPodBySelectorRunning(namespace string, selector 
string, timeout int) error {
+// Returns an error if no pods are found when 'wait' is false or not all 
discovered pods enter running state within the 'timeout' duration.
+// If 'wait' is true, error will not be returned if no pods are found. Pods 
will be continually listed until there is a non-empty list
+// to iterate over.
+func (k *KubeCtl) WaitForPodBySelectorRunning(namespace string, selector 
string, timeout int, wait bool) error {

Review comment:
       We add this `wait` parameter because other invocations of 
`WaitForPodBySelectorRunning` are directly after calls to `CreatePod`. This 
allows for the pod object to be returned by the API server at time of 
execution. In the newly added code above to restart the scheduler pod, there is 
a latency between scaling the deployment back to 1 and the ensuing call to 
create a pod, meaning the object is not readily available to be returned by the 
API server. In this case, `WaitForPodBySelectorRunning` will just error 
immediately since there are no pods with the given `selector`, requiring us to 
change the behavior with a flag. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-yunikorn-k8shim] anuraagnalluri commented on a change in pull request #369: [YUNIKORN-1040] add e2e test that re-starts the scheduler pod

Reply via email to