cf-natali opened a new pull request #397:
URL: https://github.com/apache/mesos/pull/397
Pending `ticks` are used by `scheduleTick` to decide whether to schedule
an event loop tick when a new timer is scheduled - since we only need to
schedule the event loop tick if the new timer is supposed to expire
earlier than the current earliest timer.
Unfortunately `Clock::finalize` didn't clear `ticks`, which means that
the following could happen:
- schedule a timer T0 for expiration at time t0
- call `process::reinitalize`, which calls `Clock::finalize` but doesn't
clear `ticks`
- schedule a new timer T1 for expiration at time t1 > t0: since
`scheduleTick` would see that there was already the earlier pending
tick for T0 in `ticks` with t0 < t1, it wouldn't actually schedule a
tick of the event loop
Therefore new timers would never fire again.
See
https://github.com/apache/mesos/blob/1b445c392a25e6c556acbf2bd8ed5214250ddace/3rdparty/libprocess/src/clock.cpp#L127
for the check causing subsequent timers to never fire.
This caused e.g.
`DockerContainerizerIPv6Test.ROOT_DOCKER_LaunchIPv6HostNetwork` test to
hang since it called `process::reinitialized` while having some active
timers - e.g. the reaper periodic timer.
Also added a test specifically for this bug.
Original test hanging (log abridged for clarity):
```
root@thinkpad:/home/cf/src/mesos/build# GLOG_v=9 ./bin/mesos-tests.sh
--verbose
gtest_filter=DockerContainerizerIPv6Test.ROOT_DOCKER_LaunchIPv6HostNetwork
[...]
I0626 19:12:49.205178 514538 process.cpp:3088] Cleaning up
__reaper__(1)@192.168.1.3:45143
I0626 19:12:49.205469 514545 process.cpp:935] Stopped the socket accept loop
[...]
W0626 19:12:49.268715 514523 docker_containerizer_tests.cpp:205] Pulling
alpine and mesosphere/inky. This might take a while...
I0626 19:12:49.279342 514523 process.cpp:2918] Spawned process
__latch__(5)@192.168.1.3:42489
I0626 19:12:49.279457 514581 process.cpp:2928] Resuming
__latch__(5)@192.168.1.3:42489 at 2021-06-26 18:12:49.279472128+00:00
I0626 19:12:49.279515 514581 process.cpp:2928] Resuming
__waiter__(5)@192.168.1.3:42489 at 2021-06-26 18:12:49.279522048+00:00
I0626 19:12:49.279533 514581 process.cpp:4017] Running waiter process for
__latch__(5)@192.168.1.3:42489
I0626 19:12:49.279700 514581 clock.cpp:279] Created a timer for
__waiter__(5)@192.168.1.3:42489 in 10mins in the future (2021-06-26
18:22:49.279692800+00:00)
I0626 19:12:49.267994 514587 clock.cpp:279] Created a timer for
__reaper__(2)@192.168.1.3:42489 in 100ms in the future (2021-06-26
18:12:49.367970048+00:00)
I0626 19:12:49.280148 514523 process.cpp:2918] Spawned process
__waiter__(5)@192.168.1.3:42489
I0626 19:12:49.267753 514580 process.cpp:2928] Resuming
__collect__(1)@192.168.1.3:42489 at 2021-06-26 18:12:49.267812864+00:00
^C
root@thinkpad:/home/cf/src/mesos/build#
```
We can see that timers scheduled never fire.
And new test:
```
root@thinkpad:/home/cf/src/mesos/build#
./3rdparty/libprocess/libprocess-tests --gtest_filter=*Reinit*
Note: Google Test filter = *Reinit*-
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ProcessTest
[ RUN ] ProcessTest.TimerAfterReinitialize
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0626 18:34:56.618572 500810 process.cpp:935] Stopped the socket accept loop
^C
root@thinkpad:/home/cf/src/mesos/build#
```
@asekretenko @qianzhangxa @surahman
And also maybe @bmahler - I know you don't work on Mesos anymore but I think
you might be quite familiar with this code and it's a small change so hopefully
you can take a quick look?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]