[ 
https://issues.apache.org/jira/browse/MESOS-6327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15560608#comment-15560608
 ] 

Rogier Dikkes commented on MESOS-6327:
--------------------------------------

Hi Gilbert, 

Thank you for taking the time to look at this issue!

Overlay and Aufs are not default supported/enabled storage backends (aufs and 
overlay) in the rhel kernel (centos 7.2 1511 3.10.0-327.36.1.el7.x86_64). When 
i activated the kernel modules for overlay to test these I get error: 

E1009 22:43:50.820899 26069 slave.cpp:3976] Container 
'b9ccf687-6d6f-4896-9cf8-63420c225ead' for executor 
'thermos-testuser1-test-edsdemo-0-aa8aa2c4-94ef-4e25-9d15-0394fdf02f40' of 
framework ab28b3ed-85d1-4bce-898e-e57a5f332762-0000 failed to start: Collect 
failed: Failed to mount rootfs 
'/var/lib/mesos/provisioner/containers/b9ccf687-6d6f-4896-9cf8-63420c225ead/backends/overlay/rootfses/1a5d8749-8701-4113-b0c5-ded1a6fc57d4'
 with overlayfs: No such file or directory

Changed the image_provisioner_backend and i checked if lsmod listed overlay as 
enabled. 

The sandbox is empty, there are no logs or anything.

In dmesg i found: 
[Sun Oct  9 22:27:53 2016] overlayfs: failed to resolve 
'/tmp/mesos/store/docker/layers/e919e9426cdc21e630aea7524701bf596c04f9': -2

Mentioned directory is not available.

I tried various images with more than 50+ layers. All have the same issue with 
different layers.

The layers get created and there is data in the folder:
ls -lah /var/lib/mesos/images/docker/layers/ |wc -l
68
du -sh /var/lib/mesos/images/docker/
5.5G    /var/lib/mesos/images/docker/

What i tried:
- I tried setting the sandbox_directory (as /mnt/mesos/sandbox like the 
default) but with no solution
- Tried to set docker_store_dir, without any solution
- Configured docker with overlayfs (Backing fs: xfs) and started a container, 
this works. No messages, no errors.
 
When searching for this error i came across: MESOS-6001 which is still open. 
Since you are advising aufs as a solution in a situation where im running 
against a large layer or image size issue, can that issue be closed or should i 
ignore aufs as a suggestion?

I will try to replicate this in a vagrant image tomorrow to check if this issue 
occurs also on the Apache Aurora Ubuntu vagrant image. 

> Large docker images make the mesos containerizer crash with: Too many levels 
> of symbolic links
> ----------------------------------------------------------------------------------------------
>
>                 Key: MESOS-6327
>                 URL: https://issues.apache.org/jira/browse/MESOS-6327
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, docker
>    Affects Versions: 1.0.0, 1.0.1
>         Environment: centos 7.2 (1511), ubuntu 14.04 (trusty). Replicated in 
> the Apache Aurora vagrant image
>            Reporter: Rogier Dikkes
>            Priority: Critical
>
> When deploying Mesos containers with large (6G+, 60+ layers) Docker images 
> the task crashes with the error: 
> Mesos agent logs: 
> E1007 08:40:12.954227  8117 slave.cpp:3976] Container 
> 'a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4' for executor 
> 'thermos-www-data-devel-hello_docker_image-0-d42d2af6-6b44-4b2b-be95-e1ba93a6b365'
>  of framework df
> c91a86-84b9-4539-a7be-4ace7b7b44a1-0000 failed to start: Collect failed: 
> Collect failed: Failed to copy layer: cp: cannot stat 
> ‘/var/lib/mesos/provisioner/containers/a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4/b
> ackends/copy/rootfses/5f328f72-25d4-4a26-ac83-8d30bbc44e97/usr/share/zoneinfo/right/Asia/Urumqi’:
>  Too many levels of symbolic links
> ... (complete pastebin: http://pastebin.com/umZ4Q5d1 )
> How to replicate:
> Start the aurora vagrant image. Adjust the 
> /etc/mesos-slave/executor_registration_timeout to 5 mins. Adjust the file 
> /vagrant/examples/jobs/hello_docker_image.aurora to start a large Docker 
> image instead of the example. (you can use anldisr/jupyter:0.4 i created as a 
> test image, this is based upon the jupyter notebook stacks.). Create the job, 
> watch it fail after x number of minutes. 
> The mesos sandbox is empty. 
> Aurora errors i see: 
> 28 minutes ago - FAILED : Failed to launch container: Collect failed: Collect 
> failed: Failed to copy layer: cp: cannot stat 
> ‘/var/lib/mesos/provisioner/containers/93420a36-0e0c-4f04-b401-74c426c25686/backends/copy/rootfses/6e185a51-7174-4b0d-a305-42b634eb91bb/usr/share/zoneinfo/right/Asia/Urumqi’:
>  Too many levels of symbolic links cp: cannot stat ... 
> Too many levels of symbolic links ; Container destroyed while provisioning 
> images
> (complete pastebin: http://pastebin.com/uecHYD5J )
> To rule out the image i started this and more images as a normal Docker 
> container. This works without issues. 
> Mesos flags related configured: 
> -appc_store_dir 
> /tmp/mesos/images/appc
> -containerizers 
> docker,mesos
> -executor_registration_timeout 
> 5mins
> -image_providers 
> appc,docker
> -image_provisioner_backend 
> copy
> -isolation 
> filesystem/linux,docker/runtime
> Affected Mesos versions tested: 1.0.1 & 1.0.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to