[
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178496#comment-16178496
]
Bill Green commented on MESOS-7130:
-----------------------------------
I just ran into this exact same problem using the port-mapping-isolator
compiled into a DCOS build (Mesos 1.2.2, DCOS 1.9.2). The problem agents are
running on bare metal CoreOS 1298.7.0.
We use bonding on these hosts, so our interface name is bond0.
I noticed that inside the network namespace, the MTU for the bond0 interface
showed as 1500, which is different than the host bond0, on which we use an mtu
of 9000.
After I changed our host's bond0 MTU to 1500, the port-mapping-isolator behaved
as expected.
Seems like the isolator assumes an MTU value of 1500 when it mirrors the
interface, which breaks path MTU discovery.
It wouldn't surprise me if the MTUs on AWS hosts were different than 1500.
> port_mapping isolator: executor hangs when running on EC2
> ---------------------------------------------------------
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
> Issue Type: Bug
> Components: executor
> Reporter: Pierre Cheynier
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent ffff:0
> + tc filter show dev lo parent ffff:0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0 0x00007feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /usr/lib64/libpthread.so.0
> #1 0x00007feffbed69ec in
> std::condition_variable::wait(std::unique_lock<std::mutex>&) () from
> /usr/lib64/libstdc++.so.6
> #2 0x00007ff0003dd8ec in void synchronized_wait<std::condition_variable,
> std::mutex>(std::condition_variable*, std::mutex*) () from
> /usr/lib64/libmesos-1.0.2.so
> #3 0x00007ff0017d595d in Gate::arrive(long) () from
> /usr/lib64/libmesos-1.0.2.so
> #4 0x00007ff0017c00ed in process::ProcessManager::wait(process::UPID const&)
> () from /usr/lib64/libmesos-1.0.2.so
> #5 0x00007ff0017c5c05 in process::wait(process::UPID const&, Duration
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6 0x00000000004ab26f in process::wait(process::ProcessBase const*, Duration
> const&) ()
> #7 0x00000000004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root 28420 0.8 3.0 1061420 124940 ? Ssl 17:56 0:25
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1
> --attributes=platform:centos;platform_major_version:7;type:base
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup
> --cgroups_net_cls_primary_handle=0xC370
> --container_logger=org_apache_mesos_LogrotateContainerLogger
> --containerizers=mesos,docker
> --credential=file:///etc/mesos-chef/slave-credential
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
> --default_role=default --docker_registry=/usr/share/mesos/users
> --docker_store_dir=/var/opt/mesos/store/docker
> --egress_unique_flow_per_container --enforce_container_disk_quota
> --ephemeral_ports_per_container=128
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
> --image_providers=docker --image_provisioner_backend=copy
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
> --logging_level=INFO
> --master=zk://mesos:[email protected]:2181/mesos
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051
> --recover=reconnect
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict
> --work_dir=/var/opt/mesos
> root 28484 0.0 2.3 433676 95016 ? Ssl 17:56 0:00 \_
> mesos-logrotate-logger --help=false
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-0000/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
> --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> root 28485 0.0 2.3 499212 94724 ? Ssl 17:56 0:00 \_
> mesos-logrotate-logger --help=false
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-0000/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stderr
> --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> marathon 28487 0.0 2.4 635780 97388 ? Ssl 17:56 0:00 \_
> mesos-executor --launcher_dir=/usr/libexec/mesos
> {noformat}
> If someone has a clue about the issue I could experience on EC2, I would be
> interested to talk...
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)