craigcondit commented on pull request #369: URL: https://github.com/apache/yunikorn-k8shim/pull/369#issuecomment-1086266547
> The issue is that there is a sleep pod deployed in development namespace [here](https://github.com/apache/yunikorn-k8shim/blob/82302baf7cae11b397e597bd28ba695048169dfc/test/e2e/recovery_and_restart/recovery_and_restart_test.go#L75), so the test actually expects an allocation list of length 1 [here](https://github.com/apache/yunikorn-k8shim/blob/82302baf7cae11b397e597bd28ba695048169dfc/test/e2e/recovery_and_restart/recovery_and_restart_test.go#L117). > > Most of the time, this works and we can see the corresponding allocation. Sometimes, but only on plugin mode, this allocation list that's retrieved is of length 0 despite the sleep pod being present/running in the dev namespace. This causes an error when trying to index the first element of the list with `[0]`, but it's unclear why the corresponding allocation isn't showing in the `appsInfo` var which is populated with a `GetAppInfo` call [here](https://github.com/apache/yunikorn-k8shim/blob/82302baf7cae11b397e597bd28ba695048169dfc/test/e2e/recovery_and_restart/recovery_and_restart_test.go#L103). > > It's worth noting that the only difference between this and [basic_scheduling_test behavior](https://github.com/apache/yunikorn-k8shim/blob/82302baf7cae11b397e597bd28ba695048169dfc/test/e2e/basic_scheduling/basic_scheduling_test.go#L105-L121) is that we [bounce the scheduler pod](https://github.com/apache/yunikorn-k8shim/blob/82302baf7cae11b397e597bd28ba695048169dfc/test/e2e/recovery_and_restart/recovery_and_restart_test.go#L85) prior to making these checks. We want to ensure that restarting the scheduler won't drop allocations or affect any normal functionality. But this does not seem to be the case 100% of the time specifically in plugin mode. There's a few things I can see that are issues with the test right away. For one thing, you've structured this as multiple integration tests, when in fact there is really only one. All of the restart / re-forward behavior should go into the setup method, and not be a separate IT test (Ginkgo doesn't guarantee test ordering, so the current structure is brittle, even if it appears to execute correctly). More importantly, there's nothing I can see here that waits for scheduler recovery to complete. That's not a simple thing to do, but you might try scheduling a second pod (after the restart) and wait for it to be running. Once that happens, it should be safe to query for the original pod. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
