[
https://issues.apache.org/jira/browse/MESOS-6327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15560608#comment-15560608
]
Rogier Dikkes commented on MESOS-6327:
--------------------------------------
Hi Gilbert,
Thank you for taking the time to look at this issue!
Overlay and Aufs are not default supported/enabled storage backends (aufs and
overlay) in the rhel kernel (centos 7.2 1511 3.10.0-327.36.1.el7.x86_64). When
i activated the kernel modules for overlay to test these I get error:
E1009 22:43:50.820899 26069 slave.cpp:3976] Container
'b9ccf687-6d6f-4896-9cf8-63420c225ead' for executor
'thermos-testuser1-test-edsdemo-0-aa8aa2c4-94ef-4e25-9d15-0394fdf02f40' of
framework ab28b3ed-85d1-4bce-898e-e57a5f332762-0000 failed to start: Collect
failed: Failed to mount rootfs
'/var/lib/mesos/provisioner/containers/b9ccf687-6d6f-4896-9cf8-63420c225ead/backends/overlay/rootfses/1a5d8749-8701-4113-b0c5-ded1a6fc57d4'
with overlayfs: No such file or directory
Changed the image_provisioner_backend and i checked if lsmod listed overlay as
enabled.
The sandbox is empty, there are no logs or anything.
In dmesg i found:
[Sun Oct 9 22:27:53 2016] overlayfs: failed to resolve
'/tmp/mesos/store/docker/layers/e919e9426cdc21e630aea7524701bf596c04f9': -2
Mentioned directory is not available.
I tried various images with more than 50+ layers. All have the same issue with
different layers.
The layers get created and there is data in the folder:
ls -lah /var/lib/mesos/images/docker/layers/ |wc -l
68
du -sh /var/lib/mesos/images/docker/
5.5G /var/lib/mesos/images/docker/
What i tried:
- I tried setting the sandbox_directory (as /mnt/mesos/sandbox like the
default) but with no solution
- Tried to set docker_store_dir, without any solution
- Configured docker with overlayfs (Backing fs: xfs) and started a container,
this works. No messages, no errors.
When searching for this error i came across: MESOS-6001 which is still open.
Since you are advising aufs as a solution in a situation where im running
against a large layer or image size issue, can that issue be closed or should i
ignore aufs as a suggestion?
I will try to replicate this in a vagrant image tomorrow to check if this issue
occurs also on the Apache Aurora Ubuntu vagrant image.
> Large docker images make the mesos containerizer crash with: Too many levels
> of symbolic links
> ----------------------------------------------------------------------------------------------
>
> Key: MESOS-6327
> URL: https://issues.apache.org/jira/browse/MESOS-6327
> Project: Mesos
> Issue Type: Bug
> Components: containerization, docker
> Affects Versions: 1.0.0, 1.0.1
> Environment: centos 7.2 (1511), ubuntu 14.04 (trusty). Replicated in
> the Apache Aurora vagrant image
> Reporter: Rogier Dikkes
> Priority: Critical
>
> When deploying Mesos containers with large (6G+, 60+ layers) Docker images
> the task crashes with the error:
> Mesos agent logs:
> E1007 08:40:12.954227 8117 slave.cpp:3976] Container
> 'a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4' for executor
> 'thermos-www-data-devel-hello_docker_image-0-d42d2af6-6b44-4b2b-be95-e1ba93a6b365'
> of framework df
> c91a86-84b9-4539-a7be-4ace7b7b44a1-0000 failed to start: Collect failed:
> Collect failed: Failed to copy layer: cp: cannot stat
> ‘/var/lib/mesos/provisioner/containers/a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4/b
> ackends/copy/rootfses/5f328f72-25d4-4a26-ac83-8d30bbc44e97/usr/share/zoneinfo/right/Asia/Urumqi’:
> Too many levels of symbolic links
> ... (complete pastebin: http://pastebin.com/umZ4Q5d1 )
> How to replicate:
> Start the aurora vagrant image. Adjust the
> /etc/mesos-slave/executor_registration_timeout to 5 mins. Adjust the file
> /vagrant/examples/jobs/hello_docker_image.aurora to start a large Docker
> image instead of the example. (you can use anldisr/jupyter:0.4 i created as a
> test image, this is based upon the jupyter notebook stacks.). Create the job,
> watch it fail after x number of minutes.
> The mesos sandbox is empty.
> Aurora errors i see:
> 28 minutes ago - FAILED : Failed to launch container: Collect failed: Collect
> failed: Failed to copy layer: cp: cannot stat
> ‘/var/lib/mesos/provisioner/containers/93420a36-0e0c-4f04-b401-74c426c25686/backends/copy/rootfses/6e185a51-7174-4b0d-a305-42b634eb91bb/usr/share/zoneinfo/right/Asia/Urumqi’:
> Too many levels of symbolic links cp: cannot stat ...
> Too many levels of symbolic links ; Container destroyed while provisioning
> images
> (complete pastebin: http://pastebin.com/uecHYD5J )
> To rule out the image i started this and more images as a normal Docker
> container. This works without issues.
> Mesos flags related configured:
> -appc_store_dir
> /tmp/mesos/images/appc
> -containerizers
> docker,mesos
> -executor_registration_timeout
> 5mins
> -image_providers
> appc,docker
> -image_provisioner_backend
> copy
> -isolation
> filesystem/linux,docker/runtime
> Affected Mesos versions tested: 1.0.1 & 1.0.0
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)