longfei created MESOS-9951:
------------------------------
Summary: A likely STW problem in master'gc routine
Key: MESOS-9951
URL: https://issues.apache.org/jira/browse/MESOS-9951
Project: Mesos
Issue Type: Bug
Reporter: longfei
Attachments: image-2019-08-22-14-00-16-298.png
I'm using a 1.7.3 master, which seemed to stop for half a minute recently.
{code:java}
// I0820 20:53:56.705075 4185864 registrar.cpp:487] Applied 1 operations in
1.163968ms; attempting to update the registry
I0820 20:53:56.705541 4185861 coordinator.cpp:348] Coordinator attempting to
write APPEND action at position 353
I0820 20:53:56.705739 4185875 replica.cpp:541] Replica received write request
for position 353 from __req_res__(568)@10.10.23.74:5050
I0820 20:53:56.721997 4185859 master.cpp:8753] Executor
'mt:l00000000004115106217:1' of framework
a878e862-349c-4206-bfb8-3048c841e8ec-0002 on agent
bd5550a6-4089-482d-aa96-3389bae5b0de-S179 at slave(1)@10.153.38.24:5051
(10.153.38.24): exited with status 0
I0820 20:53:56.722085 4185859 master.cpp:11215] Removing executor
'mt:l00000000004115106217:1' with resources [] of framework
a878e862-349c-4206-bfb8-3048c841e8ec-0002 on agent
bd5550a6-4089-482d-aa96-3389bae5b0de-S179 at slave(1)@10.153.38.24:5051
(10.153.38.24)
I0820 20:53:56.742550 4185877 replica.cpp:695] Replica received learned notice
for position 353 from log-network(1)@10.10.23.74:5050
I0820 20:53:56.784256 4185881 registrar.cpp:544] Successfully updated the
registry in 79.105792ms
I0820 20:53:56.784489 4185857 coordinator.cpp:348] Coordinator attempting to
write TRUNCATE action at position 354
I0820 20:53:56.784641 4185890 replica.cpp:541] Replica received write request
for position 354 from __req_res__(571)@10.10.23.74:5050
I0820 20:53:56.825901 4185890 replica.cpp:695] Replica received learned notice
for position 354 from log-network(1)@10.10.23.74:5050
I0820 20:54:34.798512 4185864 master.cpp:1978] Garbage collected 1 unreachable
and 0 gone agents from the registry
I0820 20:54:34.798610 4185864 master.cpp:8510] Status update TASK_FINISHED
(Status UUID: 6304aa62-2854-4d46-ad09-ffbf3347f24b) for task
mt:l00000000004115107127:1 of framework
a878e862-349c-4206-bfb8-3048c841e8ec-0002 from agent
bd5550a6-4089-482d-aa96-3389bae5b0de-S138 at slave(1)@10.17.44.133:5051
(10.17.44.133)
{code}
Note that their are no log produced between 20:53:56 and 20:54:34.
atop shows that a core(used by master) is full during the STW period.
!image-2019-08-22-14-00-16-298.png!
--
This message was sent by Atlassian Jira
(v8.3.2#803003)