[jira] [Comment Edited] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

Dominic Gregoire (JIRA) Mon, 20 Mar 2017 10:44:22 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15932515#comment-15932515
 ]


Dominic Gregoire edited comment on MESOS-7130 at 3/20/17 5:42 PM:
------------------------------------------------------------------

I might have run into the same issue, using {{mesos 1.1.0}} with {{libnl 
3.2.29}}, on an instance with an {{ena}} interface running kernel 
{{4.4.51-40.58.amzn1}}

The agent is running with these flags:
{noformat}
export MESOS_isolation=cgroups/cpu,cgroups/mem,network/port_mapping
export MESOS_containerizers=mesos
export MESOS_resources="ports:[31000-32000];ephemeral_ports:[32768-57344]"
export MESOS_ephemeral_ports_per_container=1024
{noformat}

Running spark 2.1.0 with 2 mesos containers on the same host, they can connect 
to each other’s block manager but can’t send traffic, it stays in their send-q.

Spark is logging:
{noformat}
17/03/19 16:54:56 INFO TransportClientFactory: Successfully created connection 
to ip-10-32-20-34.ec2.internal/10.32.20.34:34294 after 12 ms (0 ms spent in 
bootstraps)
17/03/19 16:56:56 ERROR TransportChannelHandler: Connection to 
ip-10-32-20-34.ec2.internal/10.32.20.34:34294 has been quiet for 120000 ms 
while there are outstanding requests. Assuming connection is dead;
please adjust spark.network.timeout if this is wrong.
{noformat}

I can see connections established between containers but everything stays in 
the send Qs:
{noformat}
[root@ip-10-32-20-34 sysctl.d]# ip netns
4602 (id: 1)
4600 (id: 0)
[root@ip-10-32-20-34 sysctl.d]# ip netns exec 4600 netstat -an
Connexions Internet actives (serveurs et établies)
Proto Recv-Q Send-Q Local Address               Foreign Address             
State
tcp        0      0 10.32.20.34:32861           0.0.0.0:*                   
LISTEN
tcp        0      0 0.0.0.0:33003               0.0.0.0:*                   
LISTEN
tcp        0      0 10.32.20.34:33003           10.32.20.34:57363           
ESTABLISHED
tcp        0      0 10.32.20.34:33566           10.32.20.34:34294           
ESTABLISHED
tcp        0      0 10.32.20.34:33658           10.32.18.185:40600          
ESTABLISHED
tcp        0      0 10.32.20.34:32832           10.32.18.185:40196          
ESTABLISHED
tcp        0      0 10.32.20.34:33406           10.32.20.34:5051            
ESTABLISHED
Sockets du domaine UNIX actives(serveurs et établies)
Proto RefCpt Indicatrs   Type       Etat          I-Node Chemin
unix  2      [ ]         STREAM     CONNECTE      21869
unix  2      [ ]         STREAM     CONNECTE      20339
[root@ip-10-32-20-34 sysctl.d]# ip netns exec 4602 netstat -an
Connexions Internet actives (serveurs et établies)
Proto Recv-Q Send-Q Local Address               Foreign Address             
State
tcp        0      0 0.0.0.0:33836               0.0.0.0:*                   
LISTEN
tcp        0      0 10.32.20.34:34294           0.0.0.0:*                   
LISTEN
tcp        0  24229 10.32.20.34:34294           10.32.20.34:33566           
ESTABLISHED
tcp        0      0 10.32.20.34:33860           10.32.18.185:40196          
ESTABLISHED
tcp        0      0 10.32.20.34:34680           10.32.18.185:40600          
ESTABLISHED
tcp        0      0 10.32.20.34:34434           10.32.20.34:5051            
ESTABLISHED
tcp        0      0 10.32.20.34:33836           10.32.20.34:58149           
ESTABLISHED
Sockets du domaine UNIX actives(serveurs et établies)
Proto RefCpt Indicatrs   Type       Etat          I-Node Chemin
unix  2      [ ]         STREAM     CONNECTE      20359
unix  2      [ ]         STREAM     CONNECTE      20373
{noformat}



was (Author: dgregoire):
I might have run into the same issue, using {{mesos 1.1.0}} with {{libnl 
3.2.29}}, on an instance with an {{ena}} interface.

The agent is running with these flags:
{noformat}
export MESOS_isolation=cgroups/cpu,cgroups/mem,network/port_mapping
export MESOS_containerizers=mesos
export MESOS_resources="ports:[31000-32000];ephemeral_ports:[32768-57344]"
export MESOS_ephemeral_ports_per_container=1024
{noformat}

Running spark 2.1.0 with 2 mesos containers on the same host, they can connect 
to each other’s block manager but can’t send traffic, it stays in their send-q.

Spark is logging:
{noformat}
17/03/19 16:54:56 INFO TransportClientFactory: Successfully created connection 
to ip-10-32-20-34.ec2.internal/10.32.20.34:34294 after 12 ms (0 ms spent in 
bootstraps)
17/03/19 16:56:56 ERROR TransportChannelHandler: Connection to 
ip-10-32-20-34.ec2.internal/10.32.20.34:34294 has been quiet for 120000 ms 
while there are outstanding requests. Assuming connection is dead;
please adjust spark.network.timeout if this is wrong.
{noformat}

I can see connections established between containers but everything stays in 
the send Qs:
{noformat}
[root@ip-10-32-20-34 sysctl.d]# ip netns
4602 (id: 1)
4600 (id: 0)
[root@ip-10-32-20-34 sysctl.d]# ip netns exec 4600 netstat -an
Connexions Internet actives (serveurs et établies)
Proto Recv-Q Send-Q Local Address               Foreign Address             
State
tcp        0      0 10.32.20.34:32861           0.0.0.0:*                   
LISTEN
tcp        0      0 0.0.0.0:33003               0.0.0.0:*                   
LISTEN
tcp        0      0 10.32.20.34:33003           10.32.20.34:57363           
ESTABLISHED
tcp        0      0 10.32.20.34:33566           10.32.20.34:34294           
ESTABLISHED
tcp        0      0 10.32.20.34:33658           10.32.18.185:40600          
ESTABLISHED
tcp        0      0 10.32.20.34:32832           10.32.18.185:40196          
ESTABLISHED
tcp        0      0 10.32.20.34:33406           10.32.20.34:5051            
ESTABLISHED
Sockets du domaine UNIX actives(serveurs et établies)
Proto RefCpt Indicatrs   Type       Etat          I-Node Chemin
unix  2      [ ]         STREAM     CONNECTE      21869
unix  2      [ ]         STREAM     CONNECTE      20339
[root@ip-10-32-20-34 sysctl.d]# ip netns exec 4602 netstat -an
Connexions Internet actives (serveurs et établies)
Proto Recv-Q Send-Q Local Address               Foreign Address             
State
tcp        0      0 0.0.0.0:33836               0.0.0.0:*                   
LISTEN
tcp        0      0 10.32.20.34:34294           0.0.0.0:*                   
LISTEN
tcp        0  24229 10.32.20.34:34294           10.32.20.34:33566           
ESTABLISHED
tcp        0      0 10.32.20.34:33860           10.32.18.185:40196          
ESTABLISHED
tcp        0      0 10.32.20.34:34680           10.32.18.185:40600          
ESTABLISHED
tcp        0      0 10.32.20.34:34434           10.32.20.34:5051            
ESTABLISHED
tcp        0      0 10.32.20.34:33836           10.32.20.34:58149           
ESTABLISHED
Sockets du domaine UNIX actives(serveurs et établies)
Proto RefCpt Indicatrs   Type       Etat          I-Node Chemin
unix  2      [ ]         STREAM     CONNECTE      20359
unix  2      [ ]         STREAM     CONNECTE      20373
{noformat}


> port_mapping isolator: executor hangs when running on EC2
> ---------------------------------------------------------
>
>                 Key: MESOS-7130
>                 URL: https://issues.apache.org/jira/browse/MESOS-7130
>             Project: Mesos
>          Issue Type: Bug
>          Components: ec2, executor
>            Reporter: Pierre Cheynier
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent ffff:0
> + tc filter show dev lo parent ffff:0
> I0215 16:01:13.941375     1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x00007feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x00007feffbed69ec in 
> std::condition_variable::wait(std::unique_lock<std::mutex>&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x00007ff0003dd8ec in void synchronized_wait<std::condition_variable, 
> std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x00007ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x00007ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x00007ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x00000000004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x00000000004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root     28420  0.8  3.0 1061420 124940 ?      Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:[email protected]:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root     28484  0.0  2.3 433676 95016 ?        Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-0000/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> root     28485  0.0  2.3 499212 94724 ?        Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-0000/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stderr
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> marathon 28487  0.0  2.4 635780 97388 ?        Ssl  17:56   0:00  \_ 
> mesos-executor --launcher_dir=/usr/libexec/mesos
> {noformat}
> If someone has a clue about the issue I could experience on EC2, I would be 
> interested to talk...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

Reply via email to