craigcondit commented on pull request #369:
URL: https://github.com/apache/yunikorn-k8shim/pull/369#issuecomment-1086266547


   > The issue is that there is a sleep pod deployed in development namespace 
[here](https://github.com/apache/yunikorn-k8shim/blob/82302baf7cae11b397e597bd28ba695048169dfc/test/e2e/recovery_and_restart/recovery_and_restart_test.go#L75),
 so the test actually expects an allocation list of length 1 
[here](https://github.com/apache/yunikorn-k8shim/blob/82302baf7cae11b397e597bd28ba695048169dfc/test/e2e/recovery_and_restart/recovery_and_restart_test.go#L117).
   > 
   > Most of the time, this works and we can see the corresponding allocation. 
Sometimes, but only on plugin mode, this allocation list that's retrieved is of 
length 0 despite the sleep pod being present/running in the dev namespace. This 
causes an error when trying to index the first element of the list with `[0]`, 
but it's unclear why the corresponding allocation isn't showing in the 
`appsInfo` var which is populated with a `GetAppInfo` call 
[here](https://github.com/apache/yunikorn-k8shim/blob/82302baf7cae11b397e597bd28ba695048169dfc/test/e2e/recovery_and_restart/recovery_and_restart_test.go#L103).
   > 
   > It's worth noting that the only difference between this and 
[basic_scheduling_test 
behavior](https://github.com/apache/yunikorn-k8shim/blob/82302baf7cae11b397e597bd28ba695048169dfc/test/e2e/basic_scheduling/basic_scheduling_test.go#L105-L121)
 is that we [bounce the scheduler 
pod](https://github.com/apache/yunikorn-k8shim/blob/82302baf7cae11b397e597bd28ba695048169dfc/test/e2e/recovery_and_restart/recovery_and_restart_test.go#L85)
 prior to making these checks. We want to ensure that restarting the scheduler 
won't drop allocations or affect any normal functionality. But this does not 
seem to be the case 100% of the time specifically in plugin mode.
   
   There's a few things I can see that are issues with the test right away. For 
one thing, you've structured this as multiple integration tests, when in fact 
there is really only one. All of the restart / re-forward behavior should go 
into the setup method, and not be a separate IT test (Ginkgo doesn't guarantee 
test ordering, so the current structure is brittle, even if it appears to 
execute correctly).
   
   More importantly, there's nothing I can see here that waits for scheduler 
recovery to complete. That's not a simple thing to do, but you might try 
scheduling a second pod (after the restart) and wait for it to be running. Once 
that happens, it should be safe to query for the original pod.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to