[ https://issues.apache.org/jira/browse/YUNIKORN-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063052#comment-17063052 ]
Wilfred Spiegelenburg edited comment on YUNIKORN-30 at 3/20/20, 6:12 AM: ------------------------------------------------------------------------- Further details: I have what I think is the root cause behind some of the failures. I described one of the cases above that I found but some others show a different failure type. * Incorrect partition name: we seem to call normalise on the partition name twice during our testing. {code:java} 2020-03-19T21:31:53.9142459Z 2020-03-19T21:31:41.042Z INFO cache/cluster_info.go:618 Failed to find partition for allocation proposal {"partitionName": "[rm:123][rm:123]default"} {code} This should never be a problem and shows a bug in the code and we should be able to handle this. The fix is in the normalisation code to check if it is already normalised. * Event handling: a generic underlying issue. During some local testing I noticed that we do not properly wait for the event handling to process all the events that are generated. In the case observed: allocation releases were still being processed while the end state check was performed. Those issues can be fixed by a proper wait in the test code. * However in certain failures we see nothing. This could point to a problem with go routines not being scheduled. The logs for these cases show a blank period of about 1 sec (the max time we wait for things) between the normal processing and the wait timing out. I cannot really reproduce those yet. Working on a PR to fix at least the majority of what I have found. was (Author: wifreds): Further details: I have what I think is the root cause behind some of the failures. I described one of the cases above that I found but some others show a different failure type. * Incorrect partition name: we seem to call normalise on the partition name twice during our testing. {code} 2020-03-19T21:31:53.9142459Z 2020-03-19T21:31:41.042Z INFO cache/cluster_info.go:618 Failed to find partition for allocation proposal {"partitionName": "[rm:123][rm:123]default"} {code} This should never be a problem and shows a bug in the code and we should be able to handle this. The fix is in the normalisation code to check if it is already normalised. * Event handling: a generic underlying issue. During some local testing I noticed that we do not properly wait for the event handling to process all the events that are generated. In the case observed: allocation releases were still being processed while the end state check was performed. Those issues can be fixed by a proper wait in the test code. * However in certain failures we see nothing. This could point to a problem with go routines not being scheduled. The logs for these cases show a blank period of about 1 sec (the max time we wait for things) between the normal processing and the wait timing out. I cannot really reproduce those yet. Working on a PR to fix at least the majority of what I have found. > flaky tests cause build failures on PRs > --------------------------------------- > > Key: YUNIKORN-30 > URL: https://issues.apache.org/jira/browse/YUNIKORN-30 > Project: Apache YuniKorn > Issue Type: Test > Components: test - smoke > Reporter: Wilfred Spiegelenburg > Assignee: Wilfred Spiegelenburg > Priority: Blocker > Attachments: TestBasicScheduler_github_fail.log > > > Smoke tests have been failing on PR triggered builds. > Failures are inconsistent and linked to multiple test cases, failures in the > same tests can even happen in different lines of code in different runs > without changes: > {code} > 2020-03-11T04:39:40.8332236Z --- FAIL: TestSchedulerRecovery (3.07s) > 2020-03-11T04:39:40.8340886Z ##[error] mock_rm_callback.go:175: Failed to > wait for allocations, expected 4, actual 3, called from: > TestSchedulerRecovery in scheduler_recovery_test.go:213 > {code} > {code} > 2020-03-11T04:39:40.9102758Z --- FAIL: TestBasicScheduler (1.11s) > 2020-03-11T04:39:40.9103549Z ##[error] mock_rm_callback.go:175: Failed to > wait for allocations, expected 4, actual 3, called from: TestBasicScheduler > in scheduler_smoke_test.go:341 > {code} > {code} > 2020-03-06T07:17:50.4567697Z --- FAIL: TestReservationForTwoQueues (3.10s) > 2020-03-06T07:17:50.4574239Z ##[error] scheduler_reservation_test.go:276: > partition reservations are missing > {code} > {code} > 2020-03-06T08:08:21.8912443Z --- FAIL: TestRemoveReservedNode (1.05s) > 2020-03-06T08:08:21.8917559Z ##[error] scheduler_utils.go:79: Failed to > wait for pending resource, expected 80, actual 60, called from: > TestRemoveReservedNode in scheduler_reservation_test.go:356 > {code} > {code} > 2020-03-04T10:42:16.5788872Z --- FAIL: TestRemoveReservedNode (0.07s) > 2020-03-04T10:42:16.5789359Z ##[error] scheduler_reservation_test.go:357: > assertion failed: 2 (int) != 1 (int): reservations missing from app > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org