[
https://issues.apache.org/jira/browse/MESOS-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15756044#comment-15756044
]
Greg Mann commented on MESOS-6336:
----------------------------------
Hi [~a10gupta], I'm very sorry for the long delay in my reply!
I just took a look at your patch, thanks for submitting it! While the patch
does prevent the segfault, I would prefer to get to the bottom of the real
issue: why are we attempting to remove a framework which is not in the
{{frameworks}} map? I added some additional logging output, and it looks like
we're definitely executing {{Slave::removeFramework}} and erasing the
FrameworkID from the map before {{Slave::finalize}} gets executed. It's not
immediately clear to me why {{frameworks.keys()}} would return the FrameworkID,
but then we would segfault when attempting to access the value of that key.
As [~vinodkone] suggested, it appears to me that the framework has been
completely removed from the map before {{Slave::finalize}} is called.
I notice that in {{shutdownFramework}}, we log a warning and return early if
{{framework == nullptr}}, so perhaps we do expect to attempt this operation on
nonexistent frameworks sometimes. Nonetheless, let's do some investigation and
nail down the root cause of the segfault. Would you like to get on a hangout
early next week and troubleshoot?
> SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky
> ---------------------------------------------------
>
> Key: MESOS-6336
> URL: https://issues.apache.org/jira/browse/MESOS-6336
> Project: Mesos
> Issue Type: Bug
> Components: agent
> Reporter: Greg Mann
> Assignee: Abhishek Dasgupta
> Labels: mesosphere
>
> The test {{SlaveTest.KillTaskGroupBetweenRunTaskParts}} sometimes segfaults
> during the agent's {{finalize()}} method. This was observed on our internal
> CI, on Fedora with libev, without SSL:
> {code}
> [ RUN ] SlaveTest.KillTaskGroupBetweenRunTaskParts
> I1007 14:12:57.973811 28630 cluster.cpp:158] Creating default 'local'
> authorizer
> I1007 14:12:57.982128 28630 leveldb.cpp:174] Opened db in 8.195028ms
> I1007 14:12:57.982599 28630 leveldb.cpp:181] Compacted db in 446238ns
> I1007 14:12:57.982616 28630 leveldb.cpp:196] Created db iterator in 3650ns
> I1007 14:12:57.982622 28630 leveldb.cpp:202] Seeked to beginning of db in
> 451ns
> I1007 14:12:57.982627 28630 leveldb.cpp:271] Iterated through 0 keys in the
> db in 352ns
> I1007 14:12:57.982638 28630 replica.cpp:776] Replica recovered with log
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1007 14:12:57.983024 28645 recover.cpp:451] Starting replica recovery
> I1007 14:12:57.983127 28651 recover.cpp:477] Replica is in EMPTY status
> I1007 14:12:57.983459 28644 replica.cpp:673] Replica in EMPTY status received
> a broadcasted recover request from __req_res__(6234)@172.30.2.161:38776
> I1007 14:12:57.983543 28651 recover.cpp:197] Received a recover response from
> a replica in EMPTY status
> I1007 14:12:57.983680 28650 recover.cpp:568] Updating replica status to
> STARTING
> I1007 14:12:57.983990 28648 master.cpp:380] Master
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb
> (ip-172-30-2-161.ec2.internal.mesosphere.io) started on 172.30.2.161:38776
> I1007 14:12:57.984007 28648 master.cpp:382] Flags at startup: --acls=""
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins"
> --allocation_interval="1secs" --allocator="HierarchicalDRF"
> --authenticate_agents="true" --authenticate_frameworks="true"
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true"
> --authenticate_http_readwrite="true" --authenticators="crammd5"
> --authorizers="local" --credentials="/tmp/rVbcaO/credentials"
> --framework_sorter="drf" --help="false" --hostname_lookup="true"
> --http_authenticators="basic" --http_framework_authenticators="basic"
> --initialize_driver_logging="true" --log_auto_initialize="true"
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5"
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000"
> --quiet="false" --recovery_agent_removal_limit="100%"
> --registry="replicated_log" --registry_fetch_timeout="1mins"
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks"
> --registry_max_agent_count="102400" --registry_store_timeout="100secs"
> --registry_strict="false" --root_submissions="true" --user_sorter="drf"
> --version="false" --webui_dir="/usr/local/share/mesos/webui"
> --work_dir="/tmp/rVbcaO/master" --zk_session_timeout="10secs"
> I1007 14:12:57.984127 28648 master.cpp:432] Master only allowing
> authenticated frameworks to register
> I1007 14:12:57.984134 28648 master.cpp:446] Master only allowing
> authenticated agents to register
> I1007 14:12:57.984139 28648 master.cpp:459] Master only allowing
> authenticated HTTP frameworks to register
> I1007 14:12:57.984143 28648 credentials.hpp:37] Loading credentials for
> authentication from '/tmp/rVbcaO/credentials'
> I1007 14:12:57.988487 28648 master.cpp:504] Using default 'crammd5'
> authenticator
> I1007 14:12:57.988530 28648 http.cpp:883] Using default 'basic' HTTP
> authenticator for realm 'mesos-master-readonly'
> I1007 14:12:57.988585 28648 http.cpp:883] Using default 'basic' HTTP
> authenticator for realm 'mesos-master-readwrite'
> I1007 14:12:57.988648 28648 http.cpp:883] Using default 'basic' HTTP
> authenticator for realm 'mesos-master-scheduler'
> I1007 14:12:57.988680 28648 master.cpp:584] Authorization enabled
> I1007 14:12:57.988757 28650 whitelist_watcher.cpp:77] No whitelist given
> I1007 14:12:57.988772 28646 hierarchical.cpp:149] Initialized hierarchical
> allocator process
> I1007 14:12:57.988917 28651 leveldb.cpp:304] Persisting metadata (8 bytes) to
> leveldb took 5.186917ms
> I1007 14:12:57.988934 28651 replica.cpp:320] Persisted replica status to
> STARTING
> I1007 14:12:57.989045 28651 recover.cpp:477] Replica is in STARTING status
> I1007 14:12:57.989316 28648 master.cpp:2013] Elected as the leading master!
> I1007 14:12:57.989331 28648 master.cpp:1560] Recovering from registrar
> I1007 14:12:57.989408 28649 replica.cpp:673] Replica in STARTING status
> received a broadcasted recover request from
> __req_res__(6235)@172.30.2.161:38776
> I1007 14:12:57.989423 28648 registrar.cpp:329] Recovering registrar
> I1007 14:12:57.989792 28647 recover.cpp:197] Received a recover response from
> a replica in STARTING status
> I1007 14:12:57.989956 28650 recover.cpp:568] Updating replica status to VOTING
> I1007 14:12:57.990365 28651 leveldb.cpp:304] Persisting metadata (8 bytes) to
> leveldb took 326153ns
> I1007 14:12:57.990378 28651 replica.cpp:320] Persisted replica status to
> VOTING
> I1007 14:12:57.990412 28651 recover.cpp:582] Successfully joined the Paxos
> group
> I1007 14:12:57.990470 28651 recover.cpp:466] Recover process terminated
> I1007 14:12:57.990602 28647 log.cpp:553] Attempting to start the writer
> I1007 14:12:57.990994 28647 replica.cpp:493] Replica received implicit
> promise request from __req_res__(6236)@172.30.2.161:38776 with proposal 1
> I1007 14:12:57.992638 28647 leveldb.cpp:304] Persisting metadata (8 bytes) to
> leveldb took 1.62248ms
> I1007 14:12:57.992655 28647 replica.cpp:342] Persisted promised to 1
> I1007 14:12:57.992843 28646 coordinator.cpp:238] Coordinator attempting to
> fill missing positions
> I1007 14:12:57.993249 28650 replica.cpp:388] Replica received explicit
> promise request from __req_res__(6237)@172.30.2.161:38776 for position 0 with
> proposal 2
> I1007 14:12:57.995498 28650 leveldb.cpp:341] Persisting action (8 bytes) to
> leveldb took 2.225608ms
> I1007 14:12:57.995515 28650 replica.cpp:708] Persisted action NOP at position > 0
> I1007 14:12:57.996105 28649 replica.cpp:537] Replica received write request
> for position 0 from __req_res__(6238)@172.30.2.161:38776
> I1007 14:12:57.996130 28649 leveldb.cpp:436] Reading position from leveldb
> took 8903ns
> I1007 14:12:57.997609 28649 leveldb.cpp:341] Persisting action (14 bytes) to
> leveldb took 1.465755ms
> I1007 14:12:57.997627 28649 replica.cpp:708] Persisted action NOP at position > 0
> I1007 14:12:57.997859 28646 replica.cpp:691] Replica received learned notice
> for position 0 from @0.0.0.0:0
> I1007 14:12:57.999869 28646 leveldb.cpp:341] Persisting action (16 bytes) to
> leveldb took 1.985345ms
> I1007 14:12:57.999886 28646 replica.cpp:708] Persisted action NOP at position > 0
> I1007 14:12:58.000043 28644 log.cpp:569] Writer started with ending position 0
> I1007 14:12:58.000308 28650 leveldb.cpp:436] Reading position from leveldb
> took 8280ns
> I1007 14:12:58.000512 28650 registrar.cpp:362] Successfully fetched the
> registry (0B) in 11.06816ms
> I1007 14:12:58.000562 28650 registrar.cpp:461] Applied 1 operations in
> 3050ns; attempting to update the registry
> I1007 14:12:58.000722 28646 log.cpp:577] Attempting to append 235 bytes to
> the log
> I1007 14:12:58.000799 28651 coordinator.cpp:348] Coordinator attempting to
> write APPEND action at position 1
> I1007 14:12:58.001106 28648 replica.cpp:537] Replica received write request
> for position 1 from __req_res__(6239)@172.30.2.161:38776
> I1007 14:12:58.002013 28648 leveldb.cpp:341] Persisting action (254 bytes) to
> leveldb took 884026ns
> I1007 14:12:58.002027 28648 replica.cpp:708] Persisted action APPEND at
> position 1
> I1007 14:12:58.002336 28650 replica.cpp:691] Replica received learned notice
> for position 1 from @0.0.0.0:0
> I1007 14:12:58.004251 28650 leveldb.cpp:341] Persisting action (256 bytes) to
> leveldb took 1.897221ms
> I1007 14:12:58.004268 28650 replica.cpp:708] Persisted action APPEND at
> position 1
> I1007 14:12:58.004483 28645 registrar.cpp:506] Successfully updated the
> registry in 3.902208ms
> I1007 14:12:58.004537 28646 log.cpp:596] Attempting to truncate the log to 1
> I1007 14:12:58.004566 28644 coordinator.cpp:348] Coordinator attempting to
> write TRUNCATE action at position 2
> I1007 14:12:58.004580 28645 registrar.cpp:392] Successfully recovered
> registrar
> I1007 14:12:58.004871 28650 master.cpp:1676] Recovered 0 agents from the
> registry (196B); allowing 10mins for agents to re-register
> I1007 14:12:58.004896 28648 replica.cpp:537] Replica received write request
> for position 2 from __req_res__(6240)@172.30.2.161:38776
> I1007 14:12:58.004914 28649 hierarchical.cpp:176] Skipping recovery of
> hierarchical allocator: nothing to recover
> I1007 14:12:58.006487 28648 leveldb.cpp:341] Persisting action (16 bytes) to
> leveldb took 1.566923ms
> I1007 14:12:58.006503 28648 replica.cpp:708] Persisted action TRUNCATE at
> position 2
> I1007 14:12:58.006700 28651 replica.cpp:691] Replica received learned notice
> for position 2 from @0.0.0.0:0
> I1007 14:12:58.008687 28651 leveldb.cpp:341] Persisting action (18 bytes) to
> leveldb took 1.966404ms
> I1007 14:12:58.008713 28651 leveldb.cpp:399] Deleting ~1 keys from leveldb
> took 10225ns
> I1007 14:12:58.008721 28651 replica.cpp:708] Persisted action TRUNCATE at
> position 2
> I1007 14:12:58.010568 28645 slave.cpp:208] Mesos agent started on
> @172.30.2.161:38776
> I1007 14:12:58.010679 28630 scheduler.cpp:176] Version: 1.1.0
> I1007 14:12:58.010578 28645 slave.cpp:209] Flags at startup: --acls=""
> --appc_simple_discovery_uri_prefix="http://"
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="true"
> --authenticate_http_readwrite="true" --authenticatee="crammd5"
> --authentication_backoff_factor="1secs" --authorizer="local"
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false"
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false"
> --cgroups_root="mesos" --container_disk_watch_interval="15secs"
> --containerizers="mesos"
> --credential="/mnt/teamcity/temp/buildTmp/SlaveTest_KillTaskGroupBetweenRunTaskParts_dv7FC5/credential"
> --default_role="*" --disk_watch_interval="1mins" --docker="docker"
> --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io"
> --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock"
> --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker"
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume"
> --enforce_container_disk_quota="false"
> --executor_registration_timeout="1mins"
> --executor_shutdown_grace_period="5secs"
> --fetcher_cache_dir="/mnt/teamcity/temp/buildTmp/SlaveTest_KillTaskGroupBetweenRunTaskParts_dv7FC5/fetch"
> --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks"
> --gc_disk_headroom="0.1" --hadoop_home="" --help="false"
> --hostname_lookup="true" --http_authenticators="basic"
> --http_command_executor="false"
> --http_credentials="/mnt/teamcity/temp/buildTmp/SlaveTest_KillTaskGroupBetweenRunTaskParts_dv7FC5/http_credentials"
> --image_provisioner_backend="copy" --initialize_driver_logging="true"
> --isolation="posix/cpu,posix/mem" --launcher="linux"
> --launcher_dir="/mnt/teamcity/work/4240ba9ddd0997c3/build/src"
> --logbufsecs="0" --logging_level="INFO"
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
> --perf_interval="1mins" --qos_correction_interval_min="0ns" --quiet="false"
> --recover="reconnect" --recovery_timeout="15mins"
> --registration_backoff_factor="10ms"
> --resources="cpus:2;gpus:0;mem:1024;disk:1024;ports:[31000-32000]"
> --revocable_cpu_low_priority="true"
> --runtime_dir="/mnt/teamcity/temp/buildTmp/SlaveTest_KillTaskGroupBetweenRunTaskParts_dv7FC5"
> --sandbox_directory="/mnt/mesos/sandbox" --strict="true"
> --switch_user="true" --systemd_enable_support="true"
> --systemd_runtime_directory="/run/systemd/system" --version="false"
> --work_dir="/mnt/teamcity/temp/buildTmp/SlaveTest_KillTaskGroupBetweenRunTaskParts_UtMqTW"
> --xfs_project_range="[5000-10000]"
> I1007 14:12:58.010781 28645 credentials.hpp:86] Loading credential for
> authentication from
> '/mnt/teamcity/temp/buildTmp/SlaveTest_KillTaskGroupBetweenRunTaskParts_dv7FC5/credential'
> I1007 14:12:58.010834 28645 slave.cpp:346] Agent using credential for:
> test-principal
> I1007 14:12:58.010850 28645 credentials.hpp:37] Loading credentials for
> authentication from
> '/mnt/teamcity/temp/buildTmp/SlaveTest_KillTaskGroupBetweenRunTaskParts_dv7FC5/http_credentials'
> I1007 14:12:58.010880 28644 scheduler.cpp:465] New master detected at
> [email protected]:38776
> I1007 14:12:58.010898 28644 scheduler.cpp:474] Waiting for 0ns before
> initiating a re-(connection) attempt with the master
> I1007 14:12:58.010920 28645 http.cpp:883] Using default 'basic' HTTP
> authenticator for realm 'mesos-agent-readonly'
> I1007 14:12:58.010948 28645 http.cpp:883] Using default 'basic' HTTP
> authenticator for realm 'mesos-agent-readwrite'
> I1007 14:12:58.011189 28645 slave.cpp:533] Agent resources: cpus(*):2;
> mem(*):1024; disk(*):1024; ports(*):[31000-32000]
> I1007 14:12:58.011216 28645 slave.cpp:541] Agent attributes: [ ]
> I1007 14:12:58.011221 28645 slave.cpp:546] Agent hostname:
> ip-172-30-2-161.ec2.internal.mesosphere.io
> I1007 14:12:58.011471 28651 state.cpp:57] Recovering state from
> '/mnt/teamcity/temp/buildTmp/SlaveTest_KillTaskGroupBetweenRunTaskParts_UtMqTW/meta'
> I1007 14:12:58.011592 28644 status_update_manager.cpp:203] Recovering status
> update manager
> I1007 14:12:58.011668 28644 slave.cpp:5256] Finished recovery
> I1007 14:12:58.011844 28644 slave.cpp:5428] Querying resource estimator for
> oversubscribable resources
> I1007 14:12:58.011942 28644 slave.cpp:915] New master detected at
> [email protected]:38776
> I1007 14:12:58.011958 28651 status_update_manager.cpp:177] Pausing sending
> status updates
> I1007 14:12:58.011960 28644 slave.cpp:974] Authenticating with master
> [email protected]:38776
> I1007 14:12:58.011991 28644 slave.cpp:985] Using default CRAM-MD5
> authenticatee
> I1007 14:12:58.012025 28644 slave.cpp:947] Detecting new master
> I1007 14:12:58.012051 28649 authenticatee.cpp:121] Creating new client SASL
> connection
> I1007 14:12:58.012137 28651 master.cpp:6634] Authenticating
> (401)@172.30.2.161:38776
> I1007 14:12:58.012226 28647 authenticator.cpp:414] Starting authentication
> session for crammd5-authenticatee(991)@172.30.2.161:38776
> I1007 14:12:58.012310 28651 authenticator.cpp:98] Creating new server SASL
> connection
> I1007 14:12:58.012409 28651 authenticatee.cpp:213] Received SASL
> authentication mechanisms: CRAM-MD5
> I1007 14:12:58.012428 28651 authenticatee.cpp:239] Attempting to authenticate
> with mechanism 'CRAM-MD5'
> I1007 14:12:58.012471 28651 authenticator.cpp:204] Received SASL
> authentication start
> I1007 14:12:58.012507 28651 authenticator.cpp:326] Authentication requires
> more steps
> I1007 14:12:58.012538 28648 scheduler.cpp:353] Connected with the master at
> http://172.30.2.161:38776/master/api/v1/scheduler
> I1007 14:12:58.012552 28647 authenticatee.cpp:259] Received SASL
> authentication step
> I1007 14:12:58.012673 28651 authenticator.cpp:232] Received SASL
> authentication step
> I1007 14:12:58.012693 28651 auxprop.cpp:109] Request to lookup properties for
> user: 'test-principal' realm: 'ip-172-30-2-161.ec2.internal' server FQDN:
> 'ip-172-30-2-161.ec2.internal' SASL_AUXPROP_VERIFY_AGAINST_HASH: false
> SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: false
> I1007 14:12:58.012701 28651 auxprop.cpp:181] Looking up auxiliary property
> '*userPassword'
> I1007 14:12:58.012712 28651 auxprop.cpp:181] Looking up auxiliary property
> '*cmusaslsecretCRAM-MD5'
> I1007 14:12:58.012722 28651 auxprop.cpp:109] Request to lookup properties for
> user: 'test-principal' realm: 'ip-172-30-2-161.ec2.internal' server FQDN:
> 'ip-172-30-2-161.ec2.internal' SASL_AUXPROP_VERIFY_AGAINST_HASH: false
> SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: true
> I1007 14:12:58.012729 28651 auxprop.cpp:131] Skipping auxiliary property
> '*userPassword' since SASL_AUXPROP_AUTHZID == true
> I1007 14:12:58.012737 28651 auxprop.cpp:131] Skipping auxiliary property
> '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true
> I1007 14:12:58.012749 28651 authenticator.cpp:318] Authentication success
> I1007 14:12:58.012830 28648 authenticatee.cpp:299] Authentication success
> I1007 14:12:58.012850 28645 master.cpp:6664] Successfully authenticated
> principal 'test-principal' at (401)@172.30.2.161:38776
> I1007 14:12:58.012876 28646 authenticator.cpp:432] Authentication session
> cleanup for crammd5-authenticatee(991)@172.30.2.161:38776
> I1007 14:12:58.012972 28648 slave.cpp:1069] Successfully authenticated with
> master [email protected]:38776
> I1007 14:12:58.013017 28651 scheduler.cpp:235] Sending SUBSCRIBE call to
> http://172.30.2.161:38776/master/api/v1/scheduler
> I1007 14:12:58.013093 28648 slave.cpp:1475] Will retry registration in
> 4.168895ms if necessary
> I1007 14:12:58.013185 28649 master.cpp:5074] Registering agent at
> (401)@172.30.2.161:38776 (ip-172-30-2-161.ec2.internal.mesosphere.io) with id
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0
> I1007 14:12:58.013326 28649 process.cpp:3377] Handling HTTP event for process
> 'master' with path: '/master/api/v1/scheduler'
> I1007 14:12:58.013386 28644 registrar.cpp:461] Applied 1 operations in
> 14823ns; attempting to update the registry
> I1007 14:12:58.013556 28651 log.cpp:577] Attempting to append 434 bytes to
> the log
> I1007 14:12:58.013615 28651 coordinator.cpp:348] Coordinator attempting to
> write APPEND action at position 3
> I1007 14:12:58.013741 28650 http.cpp:382] HTTP POST for
> /master/api/v1/scheduler from 172.30.2.161:48518
> I1007 14:12:58.013804 28650 master.cpp:2309] Received subscription request
> for HTTP framework 'default'
> I1007 14:12:58.013823 28650 master.cpp:2049] Authorizing framework principal
> 'test-principal' to receive offers for role '*'
> I1007 14:12:58.013964 28648 master.cpp:2407] Subscribing framework 'default'
> with checkpointing disabled and capabilities [ ]
> I1007 14:12:58.013985 28649 replica.cpp:537] Replica received write request
> for position 3 from __req_res__(6241)@172.30.2.161:38776
> I1007 14:12:58.014142 28648 master.hpp:2162] Sending heartbeat to
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.014166 28644 hierarchical.cpp:275] Added framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.014189 28644 hierarchical.cpp:1694] No allocations performed
> I1007 14:12:58.014200 28644 hierarchical.cpp:1789] No inverse offers to send
> out!
> I1007 14:12:58.014214 28644 hierarchical.cpp:1286] Performed allocation for 0
> agents in 31808ns
> I1007 14:12:58.014370 28645 scheduler.cpp:671] Enqueuing event SUBSCRIBED
> received from http://172.30.2.161:38776/master/api/v1/scheduler
> I1007 14:12:58.014456 28649 leveldb.cpp:341] Persisting action (453 bytes) to
> leveldb took 448489ns
> I1007 14:12:58.014470 28649 replica.cpp:708] Persisted action APPEND at
> position 3
> I1007 14:12:58.014523 28648 scheduler.cpp:671] Enqueuing event HEARTBEAT
> received from http://172.30.2.161:38776/master/api/v1/scheduler
> I1007 14:12:58.014883 28645 replica.cpp:691] Replica received learned notice
> for position 3 from @0.0.0.0:0
> I1007 14:12:58.016865 28645 leveldb.cpp:341] Persisting action (455 bytes) to
> leveldb took 1.963645ms
> I1007 14:12:58.016880 28645 replica.cpp:708] Persisted action APPEND at
> position 3
> I1007 14:12:58.017127 28644 registrar.cpp:506] Successfully updated the
> registry in 3.705088ms
> I1007 14:12:58.017184 28648 log.cpp:596] Attempting to truncate the log to 3
> I1007 14:12:58.017242 28651 coordinator.cpp:348] Coordinator attempting to
> write TRUNCATE action at position 4
> I1007 14:12:58.017367 28645 master.cpp:5145] Registered agent
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0 at (401)@172.30.2.161:38776
> (ip-172-30-2-161.ec2.internal.mesosphere.io) with cpus(*):2; mem(*):1024;
> disk(*):1024; ports(*):[31000-32000]
> I1007 14:12:58.017423 28644 slave.cpp:4108] Received ping from
> slave-observer(454)@172.30.2.161:38776
> I1007 14:12:58.017436 28650 hierarchical.cpp:485] Added agent
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0
> (ip-172-30-2-161.ec2.internal.mesosphere.io) with cpus(*):2; mem(*):1024;
> disk(*):1024; ports(*):[31000-32000] (allocated: {})
> I1007 14:12:58.017483 28644 slave.cpp:1115] Registered with master
> [email protected]:38776; given agent ID
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0
> I1007 14:12:58.017501 28644 fetcher.cpp:86] Clearing fetcher cache
> I1007 14:12:58.017575 28651 status_update_manager.cpp:184] Resuming sending
> status updates
> I1007 14:12:58.017680 28650 hierarchical.cpp:1789] No inverse offers to send
> out!
> I1007 14:12:58.017709 28650 hierarchical.cpp:1309] Performed allocation for
> agent 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0 in 246743ns
> I1007 14:12:58.017750 28644 slave.cpp:1138] Checkpointing SlaveInfo to
> '/mnt/teamcity/temp/buildTmp/SlaveTest_KillTaskGroupBetweenRunTaskParts_UtMqTW/meta/slaves/76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0/slave.info'
> I1007 14:12:58.017774 28645 master.cpp:6463] Sending 1 offers to framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 (default)
> I1007 14:12:58.017982 28644 replica.cpp:537] Replica received write request
> for position 4 from __req_res__(6242)@172.30.2.161:38776
> I1007 14:12:58.018200 28645 scheduler.cpp:671] Enqueuing event OFFERS
> received from http://172.30.2.161:38776/master/api/v1/scheduler
> I1007 14:12:58.018880 28646 scheduler.cpp:235] Sending ACCEPT call to
> http://172.30.2.161:38776/master/api/v1/scheduler
> I1007 14:12:58.019086 28644 leveldb.cpp:341] Persisting action (16 bytes) to
> leveldb took 1.082078ms
> I1007 14:12:58.019104 28644 replica.cpp:708] Persisted action TRUNCATE at
> position 4
> I1007 14:12:58.019212 28645 process.cpp:3377] Handling HTTP event for process
> 'master' with path: '/master/api/v1/scheduler'
> I1007 14:12:58.019390 28645 replica.cpp:691] Replica received learned notice
> for position 4 from @0.0.0.0:0
> I1007 14:12:58.019577 28647 http.cpp:382] HTTP POST for
> /master/api/v1/scheduler from 172.30.2.161:48516
> I1007 14:12:58.019711 28647 master.cpp:3521] Processing ACCEPT call for
> offers: [ 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-O0 ] on agent
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0 at (401)@172.30.2.161:38776
> (ip-172-30-2-161.ec2.internal.mesosphere.io) for framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 (default)
> I1007 14:12:58.019737 28647 master.cpp:3143] Authorizing framework principal
> 'test-principal' to launch task ed09b29b-d149-46c1-a31d-f8fe3fdf214e
> I1007 14:12:58.019785 28647 master.cpp:3143] Authorizing framework principal
> 'test-principal' to launch task 9ce218ba-84b7-4ea4-a560-b30472f4a22c
> I1007 14:12:58.020279 28647 master.cpp:8214] Adding task
> ed09b29b-d149-46c1-a31d-f8fe3fdf214e with resources cpus(*):0.1; mem(*):32;
> disk(*):32 on agent 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0
> (ip-172-30-2-161.ec2.internal.mesosphere.io)
> I1007 14:12:58.020342 28647 master.cpp:8214] Adding task
> 9ce218ba-84b7-4ea4-a560-b30472f4a22c with resources cpus(*):0.1; mem(*):32;
> disk(*):32 on agent 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0
> (ip-172-30-2-161.ec2.internal.mesosphere.io)
> I1007 14:12:58.020375 28647 master.cpp:4358] Launching task group {
> 9ce218ba-84b7-4ea4-a560-b30472f4a22c, ed09b29b-d149-46c1-a31d-f8fe3fdf214e }
> of framework 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 (default) with
> resources cpus(*):0.2; mem(*):64; disk(*):64 on agent
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0 at (401)@172.30.2.161:38776
> (ip-172-30-2-161.ec2.internal.mesosphere.io)
> I1007 14:12:58.020480 28648 slave.cpp:1539] Got assigned task group
> containing tasks [ ed09b29b-d149-46c1-a31d-f8fe3fdf214e,
> 9ce218ba-84b7-4ea4-a560-b30472f4a22c ] for framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.020515 28650 hierarchical.cpp:1018] Recovered cpus(*):1.8;
> mem(*):960; disk(*):960; ports(*):[31000-32000] (total: cpus(*):2;
> mem(*):1024; disk(*):1024; ports(*):[31000-32000], allocated: cpus(*):0.2;
> mem(*):64; disk(*):64) on agent 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0 from
> framework 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.020541 28650 hierarchical.cpp:1055] Framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 filtered agent
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0 for 5secs
> I1007 14:12:58.020830 28644 scheduler.cpp:235] Sending KILL call to
> http://172.30.2.161:38776/master/api/v1/scheduler
> I1007 14:12:58.021075 28651 process.cpp:3377] Handling HTTP event for process
> 'master' with path: '/master/api/v1/scheduler'
> I1007 14:12:58.021347 28648 http.cpp:382] HTTP POST for
> /master/api/v1/scheduler from 172.30.2.161:48516
> I1007 14:12:58.021399 28648 master.cpp:4682] Telling agent
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0 at (401)@172.30.2.161:38776
> (ip-172-30-2-161.ec2.internal.mesosphere.io) to kill task
> ed09b29b-d149-46c1-a31d-f8fe3fdf214e of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 (default)
> I1007 14:12:58.021473 28646 slave.cpp:2283] Asked to kill task
> ed09b29b-d149-46c1-a31d-f8fe3fdf214e of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> W1007 14:12:58.021491 28646 slave.cpp:2324] Killing task
> ed09b29b-d149-46c1-a31d-f8fe3fdf214e of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 before it was launched
> I1007 14:12:58.021483 28645 leveldb.cpp:341] Persisting action (18 bytes) to
> leveldb took 2.0668ms
> I1007 14:12:58.021524 28645 leveldb.cpp:399] Deleting ~2 keys from leveldb
> took 17427ns
> I1007 14:12:58.021539 28645 replica.cpp:708] Persisted action TRUNCATE at
> position 4
> I1007 14:12:58.021720 28630 slave.cpp:1696] Launching task group containing
> tasks [ ed09b29b-d149-46c1-a31d-f8fe3fdf214e,
> 9ce218ba-84b7-4ea4-a560-b30472f4a22c ] for framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> W1007 14:12:58.021749 28630 slave.cpp:1736] Ignoring running task group
> containing tasks [ ed09b29b-d149-46c1-a31d-f8fe3fdf214e,
> 9ce218ba-84b7-4ea4-a560-b30472f4a22c ] of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 because it has been killed in the
> meantime
> I1007 14:12:58.021777 28630 slave.cpp:3609] Handling status update
> TASK_KILLED (UUID: b28bbec8-5746-4f3a-bf53-2e4f1a78b4cb) for task
> ed09b29b-d149-46c1-a31d-f8fe3fdf214e of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 from @0.0.0.0:0
> W1007 14:12:58.021800 28630 slave.cpp:3705] Could not find the executor for
> status update TASK_KILLED (UUID: b28bbec8-5746-4f3a-bf53-2e4f1a78b4cb) for
> task ed09b29b-d149-46c1-a31d-f8fe3fdf214e of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.021888 28630 slave.cpp:3609] Handling status update
> TASK_KILLED (UUID: a768cdc6-4d09-467c-8a66-f0117744a8dd) for task
> 9ce218ba-84b7-4ea4-a560-b30472f4a22c of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 from @0.0.0.0:0
> I1007 14:12:58.021890 28647 status_update_manager.cpp:323] Received status
> update TASK_KILLED (UUID: b28bbec8-5746-4f3a-bf53-2e4f1a78b4cb) for task
> ed09b29b-d149-46c1-a31d-f8fe3fdf214e of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.021919 28647 status_update_manager.cpp:500] Creating
> StatusUpdate stream for task ed09b29b-d149-46c1-a31d-f8fe3fdf214e of
> framework 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> W1007 14:12:58.021913 28630 slave.cpp:3705] Could not find the executor for
> status update TASK_KILLED (UUID: a768cdc6-4d09-467c-8a66-f0117744a8dd) for
> task 9ce218ba-84b7-4ea4-a560-b30472f4a22c of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.021981 28630 slave.cpp:4709] Cleaning up framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.021986 28647 status_update_manager.cpp:377] Forwarding update
> TASK_KILLED (UUID: b28bbec8-5746-4f3a-bf53-2e4f1a78b4cb) for task
> ed09b29b-d149-46c1-a31d-f8fe3fdf214e of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 to the agent
> E1007 14:12:58.022033 28630 slave.cpp:5408] Failed to find the mtime of
> '/mnt/teamcity/temp/buildTmp/SlaveTest_KillTaskGroupBetweenRunTaskParts_UtMqTW/slaves/76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0/frameworks/76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000':
> Error invoking stat for
> '/mnt/teamcity/temp/buildTmp/SlaveTest_KillTaskGroupBetweenRunTaskParts_UtMqTW/slaves/76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0/frameworks/76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000':
> No such file or directory
> I1007 14:12:58.022058 28651 slave.cpp:4026] Forwarding the update TASK_KILLED
> (UUID: b28bbec8-5746-4f3a-bf53-2e4f1a78b4cb) for task
> ed09b29b-d149-46c1-a31d-f8fe3fdf214e of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 to [email protected]:38776
> I1007 14:12:58.022063 28647 status_update_manager.cpp:323] Received status
> update TASK_KILLED (UUID: a768cdc6-4d09-467c-8a66-f0117744a8dd) for task
> 9ce218ba-84b7-4ea4-a560-b30472f4a22c of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.022099 28647 status_update_manager.cpp:500] Creating
> StatusUpdate stream for task 9ce218ba-84b7-4ea4-a560-b30472f4a22c of
> framework 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.022140 28651 slave.cpp:3920] Status update manager
> successfully handled status update TASK_KILLED (UUID:
> b28bbec8-5746-4f3a-bf53-2e4f1a78b4cb) for task
> ed09b29b-d149-46c1-a31d-f8fe3fdf214e of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.022178 28647 status_update_manager.cpp:377] Forwarding update
> TASK_KILLED (UUID: a768cdc6-4d09-467c-8a66-f0117744a8dd) for task
> 9ce218ba-84b7-4ea4-a560-b30472f4a22c of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 to the agent
> I1007 14:12:58.022231 28646 master.cpp:5672] Status update TASK_KILLED (UUID:
> b28bbec8-5746-4f3a-bf53-2e4f1a78b4cb) for task
> ed09b29b-d149-46c1-a31d-f8fe3fdf214e of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 from agent
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0 at (401)@172.30.2.161:38776
> (ip-172-30-2-161.ec2.internal.mesosphere.io)
> I1007 14:12:58.022239 28651 slave.cpp:4026] Forwarding the update TASK_KILLED
> (UUID: a768cdc6-4d09-467c-8a66-f0117744a8dd) for task
> 9ce218ba-84b7-4ea4-a560-b30472f4a22c of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 to [email protected]:38776
> I1007 14:12:58.022256 28646 master.cpp:5734] Forwarding status update
> TASK_KILLED (UUID: b28bbec8-5746-4f3a-bf53-2e4f1a78b4cb) for task
> ed09b29b-d149-46c1-a31d-f8fe3fdf214e of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.022279 28647 status_update_manager.cpp:285] Closing status
> update streams for framework 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.022294 28647 status_update_manager.cpp:531] Cleaning up status
> update stream for task 9ce218ba-84b7-4ea4-a560-b30472f4a22c of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.022301 28651 slave.cpp:3920] Status update manager
> successfully handled status update TASK_KILLED (UUID:
> a768cdc6-4d09-467c-8a66-f0117744a8dd) for task
> 9ce218ba-84b7-4ea4-a560-b30472f4a22c of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.022330 28647 status_update_manager.cpp:531] Cleaning up status
> update stream for task ed09b29b-d149-46c1-a31d-f8fe3fdf214e of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.022347 28646 master.cpp:7592] Updating the state of task
> ed09b29b-d149-46c1-a31d-f8fe3fdf214e of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 (latest state: TASK_KILLED, status
> update state: TASK_KILLED)
> I1007 14:12:58.022449 28646 master.cpp:5672] Status update TASK_KILLED (UUID:
> a768cdc6-4d09-467c-8a66-f0117744a8dd) for task
> 9ce218ba-84b7-4ea4-a560-b30472f4a22c of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 from agent
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0 at (401)@172.30.2.161:38776
> (ip-172-30-2-161.ec2.internal.mesosphere.io)
> I1007 14:12:58.022464 28646 master.cpp:5734] Forwarding status update
> TASK_KILLED (UUID: a768cdc6-4d09-467c-8a66-f0117744a8dd) for task
> 9ce218ba-84b7-4ea4-a560-b30472f4a22c of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.022461 28644 hierarchical.cpp:1018] Recovered cpus(*):0.1;
> mem(*):32; disk(*):32 (total: cpus(*):2; mem(*):1024; disk(*):1024;
> ports(*):[31000-32000], allocated: cpus(*):0.1; mem(*):32; disk(*):32) on
> agent 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0 from framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.022600 28650 scheduler.cpp:671] Enqueuing event UPDATE
> received from http://172.30.2.161:38776/master/api/v1/scheduler
> I1007 14:12:58.022609 28646 master.cpp:7592] Updating the state of task
> 9ce218ba-84b7-4ea4-a560-b30472f4a22c of framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000 (latest state: TASK_KILLED, status
> update state: TASK_KILLED)
> I1007 14:12:58.022825 28644 hierarchical.cpp:1018] Recovered cpus(*):0.1;
> mem(*):32; disk(*):32 (total: cpus(*):2; mem(*):1024; disk(*):1024;
> ports(*):[31000-32000], allocated: {}) on agent
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-S0 from framework
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb-0000
> I1007 14:12:58.022981 28645 scheduler.cpp:671] Enqueuing event UPDATE
> received from http://172.30.2.161:38776/master/api/v1/scheduler
> I1007 14:12:58.023365 28651 slave.cpp:787] Agent terminating
> *** Aborted at 1475849578 (unix time) try "date -d @1475849578" if you are
> using GNU date ***
> PC: @ 0x7fdbf26dc95f mesos::internal::slave::Slave::finalize()
> *** SIGSEGV (@0x88) received by PID 28630 (TID 0x7fdbe5910700) from PID 136;
> stack trace: ***
> @ 0x7fdbdef6fbb2 (unknown)
> @ 0x7fdbdef74009 (unknown)
> @ 0x7fdbdef67f08 (unknown)
> @ 0x7fdbf14bc9f0 (unknown)
> @ 0x7fdbf26dc95f mesos::internal::slave::Slave::finalize()
> @ 0x7fdbf2ec461b process::ProcessManager::resume()
> @ 0x7fdbf2ec7cb7
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7fdbf0dd6f20 (unknown)
> @ 0x7fdbf14b360a start_thread
> @ 0x7fdbf054659d __clone
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)