[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

Bill Green (JIRA) Sun, 24 Sep 2017 20:47:31 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178496#comment-16178496
 ]


Bill Green commented on MESOS-7130:
-----------------------------------

I just ran into this exact same problem using the port-mapping-isolator 
compiled into a DCOS build (Mesos 1.2.2, DCOS 1.9.2). The problem agents are 
running on bare metal CoreOS 1298.7.0. 

We use bonding on these hosts, so our interface name is bond0.

I noticed that inside the network namespace, the MTU for the bond0 interface 
showed as 1500, which is different than the host bond0, on which we use an mtu 
of 9000. 

After I changed our host's bond0 MTU to 1500, the port-mapping-isolator behaved 
as expected. 

Seems like the isolator assumes an MTU value of 1500 when it mirrors the 
interface, which breaks path MTU discovery. 

It wouldn't surprise me if the MTUs on AWS hosts were different than 1500.

> port_mapping isolator: executor hangs when running on EC2
> ---------------------------------------------------------
>
>                 Key: MESOS-7130
>                 URL: https://issues.apache.org/jira/browse/MESOS-7130
>             Project: Mesos
>          Issue Type: Bug
>          Components: executor
>            Reporter: Pierre Cheynier
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent ffff:0
> + tc filter show dev lo parent ffff:0
> I0215 16:01:13.941375     1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x00007feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x00007feffbed69ec in 
> std::condition_variable::wait(std::unique_lock<std::mutex>&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x00007ff0003dd8ec in void synchronized_wait<std::condition_variable, 
> std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x00007ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x00007ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x00007ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x00000000004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x00000000004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root     28420  0.8  3.0 1061420 124940 ?      Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:[email protected]:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root     28484  0.0  2.3 433676 95016 ?        Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-0000/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> root     28485  0.0  2.3 499212 94724 ?        Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-0000/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stderr
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> marathon 28487  0.0  2.4 635780 97388 ?        Ssl  17:56   0:00  \_ 
> mesos-executor --launcher_dir=/usr/libexec/mesos
> {noformat}
> If someone has a clue about the issue I could experience on EC2, I would be 
> interested to talk...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

Reply via email to