Benjamin Bannier created MESOS-7008:
---------------------------------------
Summary: Incomplete recovery of roles leading to fatal CHECK
failure
Key: MESOS-7008
URL: https://issues.apache.org/jira/browse/MESOS-7008
Project: Mesos
Issue Type: Bug
Components: master
Environment: OS X, SSL build
Reporter: Benjamin Bannier
When a quota was set and the master is restarted, removal of the quota reliably
leads to a {{CHECK}} failure for me.
Start a master:
{code}
$ mesos-master --work_dir=work_dir
{code}
Set a quota. This creates an implicit role.
{code}
$ cat quota.json
{
"role": "role2",
"force": true,
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 1 }
}
]
}
$ cat quota.json| http POST :5050/quota
HTTP/1.1 200 OK
Content-Length: 0
Date: Thu, 26 Jan 2017 12:33:38 GMT
$ http GET :5050/quota
HTTP/1.1 200 OK
Content-Length: 108
Content-Type: application/json
Date: Thu, 26 Jan 2017 12:33:56 GMT
{
"infos": [
{
"guarantee": [
{
"name": "cpus",
"role": "*",
"scalar": {
"value": 1.0
},
"type": "SCALAR"
}
],
"role": "role2"
}
]
}
$ http GET :5050/roles
HTTP/1.1 200 OK
Content-Length: 106
Content-Type: application/json
Date: Thu, 26 Jan 2017 12:34:10 GMT
{
"roles": [
{
"frameworks": [],
"name": "role2",
"resources": {
"cpus": 0,
"disk": 0,
"gpus": 0,
"mem": 0
},
"weight": 1.0
}
]
}
{code}
Restart the master process using the same {{work_dir}} and attempt to delete
the quota after the master is started. The {{DELETE}} succeeds with an {{OK}}.
{code}
$ http DELETE :5050/quota/role2
HTTP/1.1 200 OK
Content-Length: 0
Date: Thu, 26 Jan 2017 12:36:04 GMT
{code}
After handling the request, the master hits a {{CHECK}} failure and is aborted.
{code}
$ mesos-master --work_dir=work_dir
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0126 13:34:57.528599 3145483200 main.cpp:278] Build: 2017-01-23 07:57:34 by
bbannier
I0126 13:34:57.529131 3145483200 main.cpp:279] Version: 1.2.0
I0126 13:34:57.529139 3145483200 main.cpp:286] Git SHA:
dd07d025d40975ec660ed17031d95ec0dba842d2
[warn] kq_init: detected broken kqueue; not using.: No such process
I0126 13:34:57.758896 3145483200 main.cpp:385] Using 'HierarchicalDRF' allocator
I0126 13:34:57.764276 3145483200 replica.cpp:778] Replica recovered with log
positions 3 -> 4 with 0 holes and 0 unlearned
I0126 13:34:57.765278 256114688 recover.cpp:451] Starting replica recovery
I0126 13:34:57.765547 256114688 recover.cpp:477] Replica is in VOTING status
I0126 13:34:57.795964 257187840 master.cpp:383] Master
569073cc-1195-45e9-b0d4-e2e1bf0d13d5 (172.18.9.56) started on 172.18.9.56:5050
I0126 13:34:57.796023 257187840 master.cpp:385] Flags at startup:
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins"
--allocation_interval="1secs" --allocator="HierarchicalDRF"
--authenticate_agents="false" --authenticate_frameworks="false"
--authenticate_http_frameworks="false" --authenticate_http_readonly="false"
--authenticate_http_readwrite="false" --authenticators="crammd5"
--authorizers="local" --framework_sorter="drf" --help="false"
--hostname_lookup="true" --http_authenticators="basic"
--initialize_driver_logging="true" --log_auto_initialize="true"
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5"
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000"
--max_unreachable_tasks_per_framework="1000" --quiet="false"
--recovery_agent_removal_limit="100%" --registry="replicated_log"
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins"
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400"
--registry_store_timeout="20secs" --registry_strict="false"
--root_submissions="true" --user_sorter="drf" --version="false"
--webui_dir="/usr/local/share/mesos/webui" --work_dir="work_dir"
--zk_session_timeout="10secs"
I0126 13:34:57.796478 257187840 master.cpp:437] Master allowing unauthenticated
frameworks to register
I0126 13:34:57.796507 257187840 master.cpp:451] Master allowing unauthenticated
agents to register
I0126 13:34:57.796517 257187840 master.cpp:465] Master allowing HTTP frameworks
to register without authentication
I0126 13:34:57.796540 257187840 master.cpp:507] Using default 'crammd5'
authenticator
W0126 13:34:57.796573 257187840 authenticator.cpp:512] No credentials provided,
authentication requests will be refused
I0126 13:34:57.796584 257187840 authenticator.cpp:519] Initializing server SASL
I0126 13:34:57.825337 255578112 master.cpp:2121] Elected as the leading master!
I0126 13:34:57.825362 255578112 master.cpp:1643] Recovering from registrar
I0126 13:34:57.825736 255578112 log.cpp:553] Attempting to start the writer
I0126 13:34:57.826889 258260992 replica.cpp:495] Replica received implicit
promise request from __req_res__(1)@172.18.9.56:5050 with proposal 2
I0126 13:34:57.828855 258260992 replica.cpp:344] Persisted promised to 2
I0126 13:34:57.829273 258260992 coordinator.cpp:238] Coordinator attempting to
fill missing positions
I0126 13:34:57.829375 259334144 log.cpp:569] Writer started with ending
position 4
I0126 13:34:57.830878 257187840 registrar.cpp:362] Successfully fetched the
registry (159B) in 5.427968ms
I0126 13:34:57.831029 257187840 registrar.cpp:461] Applied 1 operations in
24us; attempting to update the registry
I0126 13:34:57.836194 259334144 coordinator.cpp:348] Coordinator attempting to
write APPEND action at position 5
I0126 13:34:57.836676 257724416 replica.cpp:539] Replica received write request
for position 5 from __req_res__(2)@172.18.9.56:5050
I0126 13:34:57.837102 255578112 replica.cpp:693] Replica received learned
notice for position 5 from @0.0.0.0:0
I0126 13:34:57.837745 257187840 registrar.cpp:506] Successfully updated the
registry in 6.685184ms
I0126 13:34:57.837806 257187840 registrar.cpp:392] Successfully recovered
registrar
I0126 13:34:57.837924 255578112 coordinator.cpp:348] Coordinator attempting to
write TRUNCATE action at position 6
I0126 13:34:57.838132 256651264 master.cpp:1759] Recovered 0 agents from the
registry (159B); allowing 10mins for agents to re-register
I0126 13:34:57.838312 257187840 replica.cpp:539] Replica received write request
for position 6 from __req_res__(3)@172.18.9.56:5050
I0126 13:34:57.838692 256651264 replica.cpp:693] Replica received learned
notice for position 6 from @0.0.0.0:0
I0126 13:36:04.887257 256114688 http.cpp:420] HTTP DELETE for
/master/quota/role2 from 127.0.0.1:51458 with User-Agent='HTTPie/0.9.8'
I0126 13:36:04.887512 255578112 registrar.cpp:461] Applied 1 operations in
42us; attempting to update the registry
I0126 13:36:04.892643 255578112 coordinator.cpp:348] Coordinator attempting to
write APPEND action at position 7
I0126 13:36:04.893127 258797568 replica.cpp:539] Replica received write request
for position 7 from __req_res__(4)@172.18.9.56:5050
I0126 13:36:04.895309 257187840 replica.cpp:693] Replica received learned
notice for position 7 from @0.0.0.0:0
I0126 13:36:04.895814 258260992 registrar.cpp:506] Successfully updated the
registry in 8.2688ms
F0126 13:36:04.895956 256114688 hierarchical.cpp:1180] Check failed:
quotas.contains(role)
*** Check failure stack trace: ***
I0126 13:36:04.895961 255578112 coordinator.cpp:348] Coordinator attempting to
write TRUNCATE action at position 8
I0126 13:36:04.896437 257187840 replica.cpp:539] Replica received write request
for position 8 from __req_res__(5)@172.18.9.56:5050
I0126 13:36:04.896908 259334144 replica.cpp:693] Replica received learned
notice for position 8 from @0.0.0.0:0
@ 0x10b5e52aa google::LogMessage::Fail()
E0126 13:36:04.905042 259870720 process.cpp:2419] Failed to shutdown socket
with fd 11: Socket is not connected
@ 0x10b5e282c google::LogMessage::SendToLog()
@ 0x10b5e3959 google::LogMessage::Flush()
@ 0x10b5ee159 google::LogMessageFatal::~LogMessageFatal()
@ 0x10b5e5795 google::LogMessageFatal::~LogMessageFatal()
@ 0x1089e8d17
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeQuota()
@ 0x107ebbc13
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNSt3__112basic_stringIcNS6_11char_traitsIcEENS6_9allocatorIcEEEESC_EEvRKNS_3PIDIT_EEMSG_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESP_
@ 0x107ebbab0
_ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEESF_EEvRKNS3_3PIDIT_EEMSJ_FvT0_ET1_EUlPNS3_11ProcessBaseEE_SS_EEEvDpOT_
@ 0x107ebb7b9
_ZNSt3__110__function6__funcIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEESE_EEvRKNS2_3PIDIT_EEMSI_FvT0_ET1_EUlPNS2_11ProcessBaseEE_NSC_ISS_EEFvSR_EEclEOSR_
@ 0x10b38ba27 std::__1::function<>::operator()()
@ 0x10b38b96c process::ProcessBase::visit()
@ 0x10b40415e process::DispatchEvent::visit()
@ 0x107665171 process::ProcessBase::serve()
@ 0x10b385c07 process::ProcessManager::resume()
@ 0x10b47db90
process::ProcessManager::init_threads()::$_0::operator()()
@ 0x10b47d7e0
_ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEZN7process14ProcessManager12init_threadsEvE3$_0EEEEEPvSB_
@ 0x7fffb2b8eaab _pthread_body
@ 0x7fffb2b8e9f7 _pthread_start
@ 0x7fffb2b8e1fd thread_start
[2] 59343 abort mesos-master --work_dir=work_dir
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)