[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15869834#comment-15869834
 ] 

Pierre Cheynier commented on MESOS-7130:
----------------------------------------

[~avinash.mesos] Here is my setup :
* CentOS 7.2.1511 
* LTS Kernel (4.4.21 at that time because we have an internal frozen mechanism)
* libnl 3.2.28 (we moved to the one published in January on CentOS repos: 
https://www.rpmfind.net/linux/RPM/centos/updates/7.3.1611/x86_64/Packages/libnl3-3.2.28-3.el7_3.x86_64.html).

Every environement (physical, vbox, EC2) are on the same setup and even points 
to the same internal RPM mirrors so every package should be the same.
We use packer to build our images and most of the steps are common between vbox 
and EC2 AMIs (kernel upgrade, internal mirror, etc etc).

The install of the mesos stack itself is performed via Chef and test-kitchen.

The executor is the default {{mesos-executor}} (command-executor) in any case.

We are currently thinking about a side-effects of doing things like {{ethtool 
-K eth0 rx off}}, setting the same MAC in the netns, etc etc. on the {{vif}} 
driver (here for Xen virtual intertface - network paravirtualization).

Typically, some communications seems partially blocked...
{noformat}
# eth0 interface in the root netns
[centos@ip-10-0-143-253 ~]$ ip a s eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc fq_codel state UP 
qlen 1000
    link/ether 02:64:28:aa:e4:0d brd ff:ff:ff:ff:ff:ff
    inet 10.0.143.253/16 brd 10.0.255.255 scope global dynamic eth0
       valid_lft 3436sec preferred_lft 3436sec
    inet6 fe80::64:28ff:feaa:e40d/64 scope link 
       valid_lft forever preferred_lft forever
# Enter the netns of the task
[centos@ip-10-0-143-253 ~]$ sudo nsenter -t 9039 -n 
# Curl a simple endpoint that will return a short answer
[root@ip-10-0-143-253 centos]# curl -vv http://10.0.143.253:5051/ -m 5
* About to connect() to 10.0.143.253 port 5051 (#0)
*   Trying 10.0.143.253...
* Connected to 10.0.143.253 (10.0.143.253) port 5051 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.0.143.253:5051
> Accept: */*
> 
< HTTP/1.1 404 Not Found
< Date: Thu, 16 Feb 2017 12:28:08 GMT
< Content-Length: 0
< 
* Connection #0 to host 10.0.143.253 left intact
# Now curl something bigger
[root@ip-10-0-143-253 centos]# curl -vv 
http://10.0.143.253:5051/monitor/statistics.json -m 5
* About to connect() to 10.0.143.253 port 5051 (#0)
*   Trying 10.0.143.253...
* Connected to 10.0.143.253 (10.0.143.253) port 5051 (#0)
> GET /monitor/statistics.json HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.0.143.253:5051
> Accept: */*
> 
* Operation timed out after 5001 milliseconds with 0 out of -1 bytes received
* Closing connection 0
curl: (28) Operation timed out after 5001 milliseconds with 0 out of -1 bytes 
received
[root@ip-10-0-143-253 centos]# ip  a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 9001 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 02:64:28:aa:e4:0d brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
qlen 1000
    link/ether 02:64:28:aa:e4:0d brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.0.143.253/16 scope global eth0
       valid_lft forever preferred_lft forever
{noformat}

Doing a tcpdump in the netns shows the 3-way handshake, the sender payload, the 
corresponding ACK and ... nothing.

> port_mapping isolator: executor hangs when running on EC2
> ---------------------------------------------------------
>
>                 Key: MESOS-7130
>                 URL: https://issues.apache.org/jira/browse/MESOS-7130
>             Project: Mesos
>          Issue Type: Bug
>          Components: ec2, executor
>            Reporter: Pierre Cheynier
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent ffff:0
> + tc filter show dev lo parent ffff:0
> I0215 16:01:13.941375     1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x00007feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x00007feffbed69ec in 
> std::condition_variable::wait(std::unique_lock<std::mutex>&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x00007ff0003dd8ec in void synchronized_wait<std::condition_variable, 
> std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x00007ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x00007ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x00007ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x00000000004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x00000000004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root     28420  0.8  3.0 1061420 124940 ?      Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:test@localhost.localdomain:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root     28484  0.0  2.3 433676 95016 ?        Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-0000/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> root     28485  0.0  2.3 499212 94724 ?        Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-0000/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stderr
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> marathon 28487  0.0  2.4 635780 97388 ?        Ssl  17:56   0:00  \_ 
> mesos-executor --launcher_dir=/usr/libexec/mesos
> {noformat}
> If someone has a clue about the issue I could experience on EC2, I would be 
> interested to talk...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to