[ 
https://issues.apache.org/jira/browse/YUNIKORN-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063052#comment-17063052
 ] 

Wilfred Spiegelenburg edited comment on YUNIKORN-30 at 3/20/20, 6:12 AM:
-------------------------------------------------------------------------

Further details:

I have what I think is the root cause behind some of the failures. I described 
one of the cases above that I found but some others show a different failure 
type.
 * Incorrect partition name: we seem to call normalise on the partition name 
twice during our testing.
{code:java}
2020-03-19T21:31:53.9142459Z 2020-03-19T21:31:41.042Z   INFO    
cache/cluster_info.go:618       Failed to find partition for allocation 
proposal        {"partitionName": "[rm:123][rm:123]default"}
{code}
This should never be a problem and shows a bug in the code and we should be 
able to handle this. The fix is in the normalisation code to check if it is 
already normalised.

 * Event handling: a generic underlying issue. During some local testing I 
noticed that we do not properly wait for the event handling to process all the 
events that are generated. In the case observed: allocation releases were still 
being processed while the end state check was performed. Those issues can be 
fixed by a proper wait in the test code.
 * However in certain failures we see nothing. This could point to a problem 
with go routines not being scheduled. The logs for these cases show a blank 
period of about 1 sec (the max time we wait for things) between the normal 
processing and the wait timing out. I cannot really reproduce those yet.

Working on a PR to fix at least the majority of what I have found.


was (Author: wifreds):
Further details:

I have what I think is the root cause behind some of the failures. I described 
one of the cases above that I found but some others show a different failure 
type.

* Incorrect partition name: we seem to call normalise on the partition name 
twice during our testing.
{code}
2020-03-19T21:31:53.9142459Z 2020-03-19T21:31:41.042Z   INFO    
cache/cluster_info.go:618       Failed to find partition for allocation 
proposal        {"partitionName": "[rm:123][rm:123]default"}
{code}
This should never be a problem and shows a bug in the code and we should be 
able to handle this. The fix is in the normalisation code to check if it is 
already normalised.

* Event handling: a generic underlying issue. During some local testing I 
noticed that we do not properly wait for the event handling to process all the 
events that are generated. In the case observed: allocation releases were still 
being processed while the end state check was performed. Those issues can be 
fixed by a proper wait in the test code.

* However in certain failures we see nothing. This could point to a problem 
with go routines not being scheduled. The logs for these cases show a blank 
period of about 1 sec (the max time we wait for things) between the normal 
processing and the wait timing out. I cannot really reproduce those yet.

Working on a PR to fix at least the majority of what I have found.

> flaky tests cause build failures on PRs
> ---------------------------------------
>
>                 Key: YUNIKORN-30
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-30
>             Project: Apache YuniKorn
>          Issue Type: Test
>          Components: test - smoke
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Blocker
>         Attachments: TestBasicScheduler_github_fail.log
>
>
> Smoke tests have been failing on PR triggered builds.
> Failures are inconsistent and linked to multiple test cases, failures in the 
> same tests can even happen in different lines of code in different runs 
> without changes:
> {code}
> 2020-03-11T04:39:40.8332236Z --- FAIL: TestSchedulerRecovery (3.07s)
> 2020-03-11T04:39:40.8340886Z ##[error]    mock_rm_callback.go:175: Failed to 
> wait for allocations, expected 4, actual 3, called from: 
> TestSchedulerRecovery in scheduler_recovery_test.go:213
> {code}
> {code}
> 2020-03-11T04:39:40.9102758Z --- FAIL: TestBasicScheduler (1.11s)
> 2020-03-11T04:39:40.9103549Z ##[error]    mock_rm_callback.go:175: Failed to 
> wait for allocations, expected 4, actual 3, called from: TestBasicScheduler 
> in scheduler_smoke_test.go:341
> {code}
> {code}
> 2020-03-06T07:17:50.4567697Z --- FAIL: TestReservationForTwoQueues (3.10s)
> 2020-03-06T07:17:50.4574239Z ##[error]    scheduler_reservation_test.go:276: 
> partition reservations are missing
> {code}
> {code}
> 2020-03-06T08:08:21.8912443Z --- FAIL: TestRemoveReservedNode (1.05s)
> 2020-03-06T08:08:21.8917559Z ##[error]    scheduler_utils.go:79: Failed to 
> wait for pending resource, expected 80, actual 60, called from: 
> TestRemoveReservedNode in scheduler_reservation_test.go:356
> {code}
> {code}
> 2020-03-04T10:42:16.5788872Z --- FAIL: TestRemoveReservedNode (0.07s)
> 2020-03-04T10:42:16.5789359Z ##[error]    scheduler_reservation_test.go:357: 
> assertion failed: 2 (int) != 1 (int): reservations missing from app
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to