> On March 27, 2019, 5:10 p.m., Chun-Hung Hsiao wrote:
> > src/tests/storage_local_resource_provider_tests.cpp
> > Lines 4926 (patched)
> > <https://reviews.apache.org/r/70184/diff/1/?file=2130933#file2130933line4926>
> >
> >     I'm not sure if this is the most appropriate fix before understanding 
> > why the offer was filtered. Can you elaborate more on why this is needed?

Okay I think I now understand what happened here.

It was not because that the offer filter prevented the framework from getting 
the offer. No filter is set to refuse the newly created volume.

What actually happened was that, after the master received 
`OPERATION_FINISHED`, it would call `allocator->updateAllocation(...)` and 
`allocator->recoverResources(...)`:
https://github.com/apache/mesos/blob/cb4c9660eecae59bc2acbeb37ab19214f7b0e19b/src/master/master.cpp#L11962-L11972
Since these steps were done asynchronously, they raced with the clock advanced 
that triggered the batch allocation after L4926 in this test.
So in the failed run, the batch allocation was done before the resource (i.e., 
the created volume) was recovered, so the offer allocated to the framework did 
not have any "new" resource and thus filtered.
This can be reproduced by adding an `os::sleep(Milliseconds(10))` right before 
`allocator->updateAllocation(...)` above.

Adding a `REVIVE` call does the trick, because the revive call, which was 
processed sequentially *after* `OPERATION_FINISHED`, would asynchronuosly queue 
up an event-based allocation after the above two allocator calls.
So even if the race happens, it ensures that another allocation would be 
triggered after the created volume is recovered.
However it seems to me that a more proper fix is the following:
```
   EXPECT_EQ(OPERATION_FINISHED, updateOperationStatus->status().state());
-  // Advance the clock to trigger a batch allocation.
+  // Settle the clock to recover the created disk then advance the clock to
+  // trigger a batch allocation.
+  Clock::settle();
   Clock::advance(masterFlags.allocation_interval);

   AWAIT_READY(offers);
```


- Chun-Hung


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70184/#review214121
-----------------------------------------------------------


On March 11, 2019, 3:39 p.m., Jan Schlicht wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70184/
> -----------------------------------------------------------
> 
> (Updated March 11, 2019, 3:39 p.m.)
> 
> 
> Review request for mesos and Chun-Hung Hsiao.
> 
> 
> Bugs: MESOS-9594
>     https://issues.apache.org/jira/browse/MESOS-9594
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Under some circumstances, offers would be filtered, resulting in the
> test being stuck while waiting for offers. This has been resolved by
> removing all offer filters before waiting for offers.
> 
> 
> Diffs
> -----
> 
>   src/tests/storage_local_resource_provider_tests.cpp 
> 7945384867f26fa15dc734a235ae509d5d6d350f 
> 
> 
> Diff: https://reviews.apache.org/r/70184/diff/1/
> 
> 
> Testing
> -------
> 
> make check
> 
> I could not reproduce the flaky behavior of this test case, hence only assume 
> that this patch resolves the flakiness.
> 
> 
> Thanks,
> 
> Jan Schlicht
> 
>

Reply via email to