[
https://issues.apache.org/jira/browse/MESOS-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Rojas updated MESOS-6907:
-----------------------------------
Description:
There is apparently a race condition between the time an instance of
{{Future<T>}} goes out of scope and when the enclosing data is actually
deleted, if {{Future<T>::after(Duration, lambda::function<Future<T>(const
Future<T>&)>)}} is called.
The issue is more likely to occur if the machine is under load or if it is not
a very powerful one. The easiest way to reproduce it is to run:
{code}
$ stress -c 4 -t 2600 -d 2 -i 2 &
$ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1
--gtest_break_on_failure
{code}
An exploratory fix for the issue is to change the test to:
{code}
TEST(FutureTest, After3)
{
Future<Nothing> future;
process::WeakFuture<Nothing> weak_future(future);
EXPECT_SOME(weak_future.get());
{
Clock::pause();
// The original future disappears here. After this call the
// original future goes out of scope and should not be reachable
// anymore.
future = future
.after(Milliseconds(1), [](Future<Nothing> f) {
f.discard();
return Nothing();
});
Clock::advance(Seconds(2));
Clock::settle();
AWAIT_READY(future);
}
if (weak_future.get().isSome()) {
os::sleep(Seconds(1));
}
EXPECT_NONE(weak_future.get());
EXPECT_FALSE(future.hasDiscard());
}
{code}
The interesting thing of the fix is that both extra snippets are needed (either
one or the other is not enough) to prevent the issue from happening.
was:
After playing with the latest patch solving MESOS-6484 we found out that the
modifications done introduce a flakyness in the test {{FutureTest.After3}}. The
flakyness occurs, depending on the machine and the load of it between once
every 10000 runs and once every 500000 runs, being most likely a race condition
in the code.
To reproduce run:
{code}
${MESOS_BUILD_DIR}/3rdparty/libprocess/libprocess-tests
--gtest_filter="*.After3" --gtest_repeat=-1 --gtest_break_on_failure
{code}
> FutureTest.After3 is flaky
> --------------------------
>
> Key: MESOS-6907
> URL: https://issues.apache.org/jira/browse/MESOS-6907
> Project: Mesos
> Issue Type: Bug
> Components: libprocess
> Reporter: Alexander Rojas
>
> There is apparently a race condition between the time an instance of
> {{Future<T>}} goes out of scope and when the enclosing data is actually
> deleted, if {{Future<T>::after(Duration, lambda::function<Future<T>(const
> Future<T>&)>)}} is called.
> The issue is more likely to occur if the machine is under load or if it is
> not a very powerful one. The easiest way to reproduce it is to run:
> {code}
> $ stress -c 4 -t 2600 -d 2 -i 2 &
> $ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1
> --gtest_break_on_failure
> {code}
> An exploratory fix for the issue is to change the test to:
> {code}
> TEST(FutureTest, After3)
> {
> Future<Nothing> future;
> process::WeakFuture<Nothing> weak_future(future);
> EXPECT_SOME(weak_future.get());
> {
> Clock::pause();
> // The original future disappears here. After this call the
> // original future goes out of scope and should not be reachable
> // anymore.
> future = future
> .after(Milliseconds(1), [](Future<Nothing> f) {
> f.discard();
> return Nothing();
> });
> Clock::advance(Seconds(2));
> Clock::settle();
> AWAIT_READY(future);
> }
> if (weak_future.get().isSome()) {
> os::sleep(Seconds(1));
> }
> EXPECT_NONE(weak_future.get());
> EXPECT_FALSE(future.hasDiscard());
> }
> {code}
> The interesting thing of the fix is that both extra snippets are needed
> (either one or the other is not enough) to prevent the issue from happening.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)