[
https://issues.apache.org/jira/browse/MESOS-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Park updated MESOS-7622:
--------------------------------
Target Version/s: 1.2.2 (was: 1.2.2, 1.3.1)
> Agent can crash if a HTTP executor tries to retry subscription in running
> state.
> --------------------------------------------------------------------------------
>
> Key: MESOS-7622
> URL: https://issues.apache.org/jira/browse/MESOS-7622
> Project: Mesos
> Issue Type: Bug
> Components: agent, executor
> Reporter: Aaron Wood
> Assignee: Anand Mazumdar
> Priority: Blocker
>
> It is possible that a running executor might retry its subscribe request.
> This can lead to a crash if it previously had any launched tasks. Note that
> the executor would still be able to subscribe again when the agent process
> restarts and is recovering.
> {code}
> sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave
> --isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime
> --image_providers=docker --image_provisioner_backend=overlay
> --containerizers=mesos --launcher_dir=$(pwd)
> --executor_environment_variables='{"LD_LIBRARY_PATH":
> "/home/aaron/Code/src/mesos/build/src/.libs"}'
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by
> aaron
> I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0
> I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected
> I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state
> I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice
> `mesos_executors.slice`
> I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver
> I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation:
> cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret
> I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1'
> failed; this is the output:
> sh: 1: hadoop: not found
> I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin
> 'hadoop' as it could not be created: Failed to create HDFS client: Failed to
> execute 'hadoop version 2>&1'; the command was either not found or exited
> with a non-zero exit status: 127
> I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend
> 'overlay'
> I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on
> (1)@127.0.1.1:5051
> I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup:
> --appc_simple_discovery_uri_prefix="http://"
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false"
> --authenticate_http_readwrite="false" --authenticatee="crammd5"
> --authentication_backoff_factor="1secs" --authorizer="local"
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false"
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false"
> --cgroups_root="mesos" --container_disk_watch_interval="15secs"
> --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins"
> --docker="docker" --docker_kill_orphans="true"
> --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs"
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns"
> --docker_store_dir="/tmp/mesos/store/docker"
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume"
> --enforce_container_disk_quota="false"
> --executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}"
> --executor_registration_timeout="1mins"
> --executor_reregistration_timeout="2secs"
> --executor_shutdown_grace_period="5secs"
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
> --hadoop_home="" --help="false" --hostname_lookup="true"
> --http_command_executor="false" --http_heartbeat_interval="30secs"
> --image_providers="docker" --image_provisioner_backend="overlay"
> --initialize_driver_logging="true"
> --isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime"
> --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src"
> --logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050"
> --max_completed_executors_per_framework="150"
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
> --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true"
> --runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox"
> --strict="true" --switch_user="true" --systemd_enable_support="true"
> --systemd_runtime_directory="/run/systemd/system" --version="false"
> --work_dir="/tmp/slave"
> I0605 14:58:23.786392 10710 slave.cpp:552] Agent resources: cpus(*):6;
> mem(*):6956; disk(*):41113; ports(*):[31000-32000]
> I0605 14:58:23.786437 10710 slave.cpp:560] Agent attributes: [ ]
> I0605 14:58:23.786468 10710 slave.cpp:565] Agent hostname: U64
> I0605 14:58:23.786574 10714 status_update_manager.cpp:177] Pausing sending
> status updates
> I0605 14:58:23.787470 10718 state.cpp:62] Recovering state from
> '/tmp/slave/meta'
> I0605 14:58:23.787698 10713 status_update_manager.cpp:203] Recovering status
> update manager
> I0605 14:58:23.787755 10713 containerizer.cpp:582] Recovering containerizer
> I0605 14:58:23.788499 10714 linux_launcher.cpp:291] Recovered container
> 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.788805 10714 linux_launcher.cpp:373]
> 0fbfe58f-a1bf-4128-83ed-81c2901299b7 is a known orphaned container
> I0605 14:58:23.790591 10714 memory.cpp:479] Started listening for OOM events
> for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.791183 10714 memory.cpp:590] Started listening on 'low' memory
> pressure events for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.791714 10714 memory.cpp:590] Started listening on 'medium'
> memory pressure events for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.792260 10714 memory.cpp:590] Started listening on 'critical'
> memory pressure events for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.792899 10717 provisioner.cpp:372] Recovered container
> 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.793344 10717 metadata_manager.cpp:236] Successfully loaded 1
> Docker images
> I0605 14:58:23.793432 10717 provisioner.cpp:416] Provisioner recovery complete
> I0605 14:58:23.793514 10717 containerizer.cpp:939] Cleaning up orphan
> container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.793536 10717 containerizer.cpp:2101] Destroying container
> 0fbfe58f-a1bf-4128-83ed-81c2901299b7 in RUNNING state
> I0605 14:58:23.793669 10717 slave.cpp:6127] Finished recovery
> I0605 14:58:23.793853 10711 linux_launcher.cpp:505] Asked to destroy
> container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.794039 10717 slave.cpp:945] New master detected at
> [email protected]:5050
> I0605 14:58:23.794100 10717 slave.cpp:969] No credentials provided.
> Attempting to register without authentication
> I0605 14:58:23.794109 10717 slave.cpp:980] Detecting new master
> I0605 14:58:23.794131 10717 status_update_manager.cpp:177] Pausing sending
> status updates
> I0605 14:58:23.794304 10711 linux_launcher.cpp:548] Using freezer to destroy
> cgroup mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.794937 10717 cgroups.cpp:2692] Freezing cgroup
> /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.795821 10711 cgroups.cpp:1405] Successfully froze cgroup
> /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7 after
> 850944ns
> I0605 14:58:23.813447 10714 cgroups.cpp:2710] Thawing cgroup
> /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.816241 10714 cgroups.cpp:1434] Successfully thawed cgroup
> /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7 after
> 2.750976ms
> I0605 14:58:23.851388 10711 containerizer.cpp:2507] Container
> 0fbfe58f-a1bf-4128-83ed-81c2901299b7 has exited
> W0605 14:58:23.853590 10712 disk.cpp:448] Ignoring cleanup for unknown
> container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.856833 10711 provisioner.cpp:576] Destroying container rootfs
> at
> '/tmp/slave/provisioner/containers/0fbfe58f-a1bf-4128-83ed-81c2901299b7/backends/overlay/rootfses/ade95f11-60ec-47ee-9e4e-bcbc63cc5eaa'
> for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:24.572808 10712 slave.cpp:1148] Registered with master
> [email protected]:5050; given agent ID eb661932-585d-453a-bc64-a720e05ce634-S0
> I0605 14:58:24.572888 10711 status_update_manager.cpp:184] Resuming sending
> status updates
> I0605 14:58:24.573446 10712 slave.cpp:1206] Forwarding total oversubscribed
> resources {}
> I0605 14:58:24.579311 10711 slave.cpp:1632] Got assigned task
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:24.580113 10711 slave.cpp:1913] Authorizing task
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:24.580293 10711 slave.cpp:2100] Launching task
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:24.580703 10711 paths.cpp:573] Trying to chown
> '/tmp/slave/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a'
> to user 'root'
> I0605 14:58:24.581671 10711 slave.cpp:7086] Launching executor
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 with resources cpus(*)(allocated:
> *):0.1; mem(*)(allocated: *):32 in work directory
> '/tmp/slave/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a'
> I0605 14:58:24.581881 10711 slave.cpp:2795] Launching container
> cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a for executor
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:24.582072 10712 containerizer.cpp:1056] Starting container
> cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.582391 10711 slave.cpp:2329] Queued task
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for executor
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:24.583070 10712 provisioner.cpp:459] Provisioning image rootfs
> '/tmp/slave/provisioner/containers/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a/backends/overlay/rootfses/736ac735-51b1-4bc9-892b-15330d0784f9'
> for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a using overlay backend
> I0605 14:58:24.587167 10715 memory.cpp:479] Started listening for OOM events
> for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.587508 10715 memory.cpp:590] Started listening on 'low' memory
> pressure events for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.587805 10715 memory.cpp:590] Started listening on 'medium'
> memory pressure events for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.588145 10715 memory.cpp:590] Started listening on 'critical'
> memory pressure events for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.588714 10715 memory.cpp:199] Updated
> 'memory.soft_limit_in_bytes' to 64MB for container
> cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.589285 10715 memory.cpp:228] Updated 'memory.limit_in_bytes'
> to 64MB for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.589588 10715 cpu.cpp:101] Updated 'cpu.shares' to 204 (cpus
> 0.2) for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.591984 10711 linux_launcher.cpp:429] Launching container
> cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a and cloning with namespaces CLONE_NEWNS
> I0605 14:58:24.593714 10711 systemd.cpp:96] Assigned child process '10736' to
> 'mesos_executors.slice'
> I0605 14:58:24.595461 10712 containerizer.cpp:1722] Checkpointing container's
> forked pid 10736 to
> '/tmp/slave/meta/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a/pids/forked.pid'
> I0605 14:58:25.247808 10718 slave.cpp:3825] Got registration for executor
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 from executor(1)@127.0.1.1:42459
> I0605 14:58:25.248884 10715 disk.cpp:208] Updating the disk resources for
> container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a to cpus(*)(allocated: *):0.2;
> mem(*)(allocated: *):64; disk(*)(allocated: *):1024
> I0605 14:58:25.249397 10716 memory.cpp:199] Updated
> 'memory.soft_limit_in_bytes' to 64MB for container
> cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:25.250066 10716 cpu.cpp:101] Updated 'cpu.shares' to 204 (cpus
> 0.2) for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:25.250365 10715 slave.cpp:2542] Sending queued task
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' to executor
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 at executor(1)@127.0.1.1:42459
> I0605 14:58:25.253556 10716 slave.cpp:4295] Handling status update
> TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task
> testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 from executor(1)@127.0.1.1:42459
> I0605 14:58:25.254091 10714 status_update_manager.cpp:323] Received status
> update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task
> testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:25.254287 10714 status_update_manager.cpp:834] Checkpointing
> UPDATE for status update TASK_RUNNING (UUID:
> 0292182c-35e6-4df8-8583-fe11f726c211) for task
> testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:25.254456 10714 slave.cpp:4735] Forwarding the update
> TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task
> testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 to [email protected]:5050
> I0605 14:58:25.254534 10714 slave.cpp:4645] Sending acknowledgement for
> status update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for
> task testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 to executor(1)@127.0.1.1:42459
> I0605 14:58:25.260128 10712 status_update_manager.cpp:395] Received status
> update acknowledgement (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task
> testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:25.260184 10712 status_update_manager.cpp:834] Checkpointing ACK
> for status update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211)
> for task testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:25.352159 10712 http.cpp:1115] HTTP POST for
> /slave(1)/api/v1/executor from 10.0.2.15:55512 with
> User-Agent='mesos-framework-sdk'
> I0605 14:58:25.352342 10712 slave.cpp:3609] Received Subscribe request for
> HTTP executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 at executor(1)@127.0.1.1:42459
> I0605 14:58:25.352388 10712 slave.cpp:3672] Creating a marker file for HTTP
> based executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 (via HTTP) at path
> '/tmp/slave/meta/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a/http.marker'
> F0605 14:58:25.352444 10712 slave.cpp:3685] Check failed:
> executor->launchedTasks.empty() Newly registered executor
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' has launched tasks
> *** Check failure stack trace: ***
> @ 0x7fb835fe30ed google::LogMessage::Fail()
> @ 0x7fb835fe4ef0 google::LogMessage::SendToLog()
> @ 0x7fb835fe2cb3 google::LogMessage::Flush()
> @ 0x7fb835fe5939 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fb8346e126c mesos::internal::slave::Slave::subscribe()
> @ 0x7fb8345f3169 mesos::internal::slave::Http::executor()
> @ 0x7fb8346b1107
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionINS2_14authentication9PrincipalEEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_SD_E1_E9_M_invokeERKSt9_Any_dataS7_SD_
> @ 0x7fb833964a10
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEEvEZZNKS0_9_DeferredIZZNS0_11ProcessBase6_visitERKNS7_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS0_5OwnedINS2_7RequestEEEENKUlRK6OptionINS2_14authentication20AuthenticationResultEEE0_clEST_EUlbE1_EcvSt8functionIFT_T0_EEIS4_RKbEEvENKUlS14_E_clES14_EUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7fb83465fffe
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8internal8DispatchINS0_6FutureINS0_4http8ResponseEEEEclIRSt8functionIFS9_vEEEES9_RKNS0_4UPIDEOT_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7fb8339607ec process::ProcessBase::visit()
> @ 0x7fb83396c03b process::ProcessManager::resume()
> @ 0x7fb8339755bf
> _ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7fb832a3e83f (unknown)
> @ 0x7fb83596d6da start_thread
> @ 0x7fb8324ad17f (unknown)
> Aborted (core dumped)
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)