[ 
https://issues.apache.org/jira/browse/MESOS-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7622:
--------------------------------
    Target Version/s: 1.2.2  (was: 1.2.2, 1.3.1)

> Agent can crash if a HTTP executor tries to retry subscription in running 
> state.
> --------------------------------------------------------------------------------
>
>                 Key: MESOS-7622
>                 URL: https://issues.apache.org/jira/browse/MESOS-7622
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, executor
>            Reporter: Aaron Wood
>            Assignee: Anand Mazumdar
>            Priority: Blocker
>
> It is possible that a running executor might retry its subscribe request. 
> This can lead to a crash if it previously had any launched tasks. Note that 
> the executor would still be able to subscribe again when the agent process 
> restarts and is recovering.
> {code}
> sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave 
> --isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime
>  --image_providers=docker --image_provisioner_backend=overlay 
> --containerizers=mesos --launcher_dir=$(pwd) 
> --executor_environment_variables='{"LD_LIBRARY_PATH": 
> "/home/aaron/Code/src/mesos/build/src/.libs"}'
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by 
> aaron
> I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0
> I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected
> I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state
> I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice 
> `mesos_executors.slice`
> I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver
> I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: 
> cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret
> I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: 1: hadoop: not found
> I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin 
> 'hadoop' as it could not be created: Failed to create HDFS client: Failed to 
> execute 'hadoop version 2>&1'; the command was either not found or exited 
> with a non-zero exit status: 127
> I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend 
> 'overlay'
> I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on 
> (1)@127.0.1.1:5051
> I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://"; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
> --docker="docker" --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}"
>  --executor_registration_timeout="1mins" 
> --executor_reregistration_timeout="2secs" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" --hostname_lookup="true" 
> --http_command_executor="false" --http_heartbeat_interval="30secs" 
> --image_providers="docker" --image_provisioner_backend="overlay" 
> --initialize_driver_logging="true" 
> --isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime"
>  --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src" 
> --logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" 
> --max_completed_executors_per_framework="150" 
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
> --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
> --runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" 
> --strict="true" --switch_user="true" --systemd_enable_support="true" 
> --systemd_runtime_directory="/run/systemd/system" --version="false" 
> --work_dir="/tmp/slave"
> I0605 14:58:23.786392 10710 slave.cpp:552] Agent resources: cpus(*):6; 
> mem(*):6956; disk(*):41113; ports(*):[31000-32000]
> I0605 14:58:23.786437 10710 slave.cpp:560] Agent attributes: [  ]
> I0605 14:58:23.786468 10710 slave.cpp:565] Agent hostname: U64
> I0605 14:58:23.786574 10714 status_update_manager.cpp:177] Pausing sending 
> status updates
> I0605 14:58:23.787470 10718 state.cpp:62] Recovering state from 
> '/tmp/slave/meta'
> I0605 14:58:23.787698 10713 status_update_manager.cpp:203] Recovering status 
> update manager
> I0605 14:58:23.787755 10713 containerizer.cpp:582] Recovering containerizer
> I0605 14:58:23.788499 10714 linux_launcher.cpp:291] Recovered container 
> 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.788805 10714 linux_launcher.cpp:373] 
> 0fbfe58f-a1bf-4128-83ed-81c2901299b7 is a known orphaned container
> I0605 14:58:23.790591 10714 memory.cpp:479] Started listening for OOM events 
> for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.791183 10714 memory.cpp:590] Started listening on 'low' memory 
> pressure events for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.791714 10714 memory.cpp:590] Started listening on 'medium' 
> memory pressure events for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.792260 10714 memory.cpp:590] Started listening on 'critical' 
> memory pressure events for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.792899 10717 provisioner.cpp:372] Recovered container 
> 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.793344 10717 metadata_manager.cpp:236] Successfully loaded 1 
> Docker images
> I0605 14:58:23.793432 10717 provisioner.cpp:416] Provisioner recovery complete
> I0605 14:58:23.793514 10717 containerizer.cpp:939] Cleaning up orphan 
> container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.793536 10717 containerizer.cpp:2101] Destroying container 
> 0fbfe58f-a1bf-4128-83ed-81c2901299b7 in RUNNING state
> I0605 14:58:23.793669 10717 slave.cpp:6127] Finished recovery
> I0605 14:58:23.793853 10711 linux_launcher.cpp:505] Asked to destroy 
> container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.794039 10717 slave.cpp:945] New master detected at 
> master@10.0.2.15:5050
> I0605 14:58:23.794100 10717 slave.cpp:969] No credentials provided. 
> Attempting to register without authentication
> I0605 14:58:23.794109 10717 slave.cpp:980] Detecting new master
> I0605 14:58:23.794131 10717 status_update_manager.cpp:177] Pausing sending 
> status updates
> I0605 14:58:23.794304 10711 linux_launcher.cpp:548] Using freezer to destroy 
> cgroup mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.794937 10717 cgroups.cpp:2692] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.795821 10711 cgroups.cpp:1405] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7 after 
> 850944ns
> I0605 14:58:23.813447 10714 cgroups.cpp:2710] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.816241 10714 cgroups.cpp:1434] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7 after 
> 2.750976ms
> I0605 14:58:23.851388 10711 containerizer.cpp:2507] Container 
> 0fbfe58f-a1bf-4128-83ed-81c2901299b7 has exited
> W0605 14:58:23.853590 10712 disk.cpp:448] Ignoring cleanup for unknown 
> container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:23.856833 10711 provisioner.cpp:576] Destroying container rootfs 
> at 
> '/tmp/slave/provisioner/containers/0fbfe58f-a1bf-4128-83ed-81c2901299b7/backends/overlay/rootfses/ade95f11-60ec-47ee-9e4e-bcbc63cc5eaa'
>  for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
> I0605 14:58:24.572808 10712 slave.cpp:1148] Registered with master 
> master@10.0.2.15:5050; given agent ID eb661932-585d-453a-bc64-a720e05ce634-S0
> I0605 14:58:24.572888 10711 status_update_manager.cpp:184] Resuming sending 
> status updates
> I0605 14:58:24.573446 10712 slave.cpp:1206] Forwarding total oversubscribed 
> resources {}
> I0605 14:58:24.579311 10711 slave.cpp:1632] Got assigned task 
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:24.580113 10711 slave.cpp:1913] Authorizing task 
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:24.580293 10711 slave.cpp:2100] Launching task 
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:24.580703 10711 paths.cpp:573] Trying to chown 
> '/tmp/slave/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a'
>  to user 'root'
> I0605 14:58:24.581671 10711 slave.cpp:7086] Launching executor 
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 with resources cpus(*)(allocated: 
> *):0.1; mem(*)(allocated: *):32 in work directory 
> '/tmp/slave/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a'
> I0605 14:58:24.581881 10711 slave.cpp:2795] Launching container 
> cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a for executor 
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:24.582072 10712 containerizer.cpp:1056] Starting container 
> cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.582391 10711 slave.cpp:2329] Queued task 
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for executor 
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:24.583070 10712 provisioner.cpp:459] Provisioning image rootfs 
> '/tmp/slave/provisioner/containers/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a/backends/overlay/rootfses/736ac735-51b1-4bc9-892b-15330d0784f9'
>  for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a using overlay backend
> I0605 14:58:24.587167 10715 memory.cpp:479] Started listening for OOM events 
> for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.587508 10715 memory.cpp:590] Started listening on 'low' memory 
> pressure events for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.587805 10715 memory.cpp:590] Started listening on 'medium' 
> memory pressure events for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.588145 10715 memory.cpp:590] Started listening on 'critical' 
> memory pressure events for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.588714 10715 memory.cpp:199] Updated 
> 'memory.soft_limit_in_bytes' to 64MB for container 
> cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.589285 10715 memory.cpp:228] Updated 'memory.limit_in_bytes' 
> to 64MB for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.589588 10715 cpu.cpp:101] Updated 'cpu.shares' to 204 (cpus 
> 0.2) for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:24.591984 10711 linux_launcher.cpp:429] Launching container 
> cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a and cloning with namespaces CLONE_NEWNS
> I0605 14:58:24.593714 10711 systemd.cpp:96] Assigned child process '10736' to 
> 'mesos_executors.slice'
> I0605 14:58:24.595461 10712 containerizer.cpp:1722] Checkpointing container's 
> forked pid 10736 to 
> '/tmp/slave/meta/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a/pids/forked.pid'
> I0605 14:58:25.247808 10718 slave.cpp:3825] Got registration for executor 
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 from executor(1)@127.0.1.1:42459
> I0605 14:58:25.248884 10715 disk.cpp:208] Updating the disk resources for 
> container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a to cpus(*)(allocated: *):0.2; 
> mem(*)(allocated: *):64; disk(*)(allocated: *):1024
> I0605 14:58:25.249397 10716 memory.cpp:199] Updated 
> 'memory.soft_limit_in_bytes' to 64MB for container 
> cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:25.250066 10716 cpu.cpp:101] Updated 'cpu.shares' to 204 (cpus 
> 0.2) for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
> I0605 14:58:25.250365 10715 slave.cpp:2542] Sending queued task 
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' to executor 
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 at executor(1)@127.0.1.1:42459
> I0605 14:58:25.253556 10716 slave.cpp:4295] Handling status update 
> TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task 
> testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 from executor(1)@127.0.1.1:42459
> I0605 14:58:25.254091 10714 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task 
> testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:25.254287 10714 status_update_manager.cpp:834] Checkpointing 
> UPDATE for status update TASK_RUNNING (UUID: 
> 0292182c-35e6-4df8-8583-fe11f726c211) for task 
> testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:25.254456 10714 slave.cpp:4735] Forwarding the update 
> TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task 
> testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 to master@10.0.2.15:5050
> I0605 14:58:25.254534 10714 slave.cpp:4645] Sending acknowledgement for 
> status update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for 
> task testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 to executor(1)@127.0.1.1:42459
> I0605 14:58:25.260128 10712 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task 
> testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:25.260184 10712 status_update_manager.cpp:834] Checkpointing ACK 
> for status update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) 
> for task testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
> I0605 14:58:25.352159 10712 http.cpp:1115] HTTP POST for 
> /slave(1)/api/v1/executor from 10.0.2.15:55512 with 
> User-Agent='mesos-framework-sdk'
> I0605 14:58:25.352342 10712 slave.cpp:3609] Received Subscribe request for 
> HTTP executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 at executor(1)@127.0.1.1:42459
> I0605 14:58:25.352388 10712 slave.cpp:3672] Creating a marker file for HTTP 
> based executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework 
> b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 (via HTTP) at path 
> '/tmp/slave/meta/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a/http.marker'
> F0605 14:58:25.352444 10712 slave.cpp:3685] Check failed: 
> executor->launchedTasks.empty()  Newly registered executor 
> 'testapp-cc6e64001fee44e3a20d7a15149d8b34' has launched tasks
> *** Check failure stack trace: ***
>     @     0x7fb835fe30ed  google::LogMessage::Fail()
>     @     0x7fb835fe4ef0  google::LogMessage::SendToLog()
>     @     0x7fb835fe2cb3  google::LogMessage::Flush()
>     @     0x7fb835fe5939  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7fb8346e126c  mesos::internal::slave::Slave::subscribe()
>     @     0x7fb8345f3169  mesos::internal::slave::Http::executor()
>     @     0x7fb8346b1107  
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionINS2_14authentication9PrincipalEEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_SD_E1_E9_M_invokeERKSt9_Any_dataS7_SD_
>     @     0x7fb833964a10  
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEEvEZZNKS0_9_DeferredIZZNS0_11ProcessBase6_visitERKNS7_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS0_5OwnedINS2_7RequestEEEENKUlRK6OptionINS2_14authentication20AuthenticationResultEEE0_clEST_EUlbE1_EcvSt8functionIFT_T0_EEIS4_RKbEEvENKUlS14_E_clES14_EUlvE_E9_M_invokeERKSt9_Any_data
>     @     0x7fb83465fffe  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8internal8DispatchINS0_6FutureINS0_4http8ResponseEEEEclIRSt8functionIFS9_vEEEES9_RKNS0_4UPIDEOT_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
>     @     0x7fb8339607ec  process::ProcessBase::visit()
>     @     0x7fb83396c03b  process::ProcessManager::resume()
>     @     0x7fb8339755bf  
> _ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>     @     0x7fb832a3e83f  (unknown)
>     @     0x7fb83596d6da  start_thread
>     @     0x7fb8324ad17f  (unknown)
> Aborted (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to