[ 
https://issues.apache.org/jira/browse/MESOS-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352833#comment-16352833
 ] 

Till Toenshoff commented on MESOS-8546:
---------------------------------------

* why does the cache fail to write?
 ** because Pythons egg caching does not cope well with parallel processes 
trying to cache the very same egg.
 * why did we never see this before?
 ** we used to bind both, the scheduler as well as the executor egg into one 
module. Once the scheduler was loaded, the caching was solved. Now that the 
executor has its individual egg, we see this problem pop up cause the framework 
launches multiple tasks which in turn launch executors which will try to cache 
their driver egg. 
 * how can we solve this?
 ** the individual tasks should not mutate host machine state and hence making 
sure that all tasks get their individual {{PYTHON_EGG_CACHE}} within the 
{{MESOS_SANDBOX}} seems to be a proper solution preventing concurrency problems 
on that end while also making sure the state can get properly GCed.

> PythonFramework test fails with cache write failure.
> ----------------------------------------------------
>
>                 Key: MESOS-8546
>                 URL: https://issues.apache.org/jira/browse/MESOS-8546
>             Project: Mesos
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.6.0
>            Reporter: Till Toenshoff
>            Assignee: Till Toenshoff
>            Priority: Major
>              Labels: flaky, flaky-test, mesosphere
>
> After some recent changes, the  {{ExamplesTest.PythonFramework}} fails on 
> centos and ubuntu rather frequently (but not always).
> The symptom always is like this (taken from an ASF CI run): 
> {noformat}
> [...]
> I0203 03:21:06.871362 11001 leveldb.cpp:347] Persisting action (16 bytes) to 
> leveldb took 73.84466ms
> I0203 03:21:06.871433 11001 replica.cpp:712] Persisted action TRUNCATE at 
> position 8
> I0203 03:21:06.871841 10984 replica.cpp:695] Replica received learned notice 
> for position 8 from log-network(1)@172.17.0.4:43102
> I0203 03:21:06.908581 11004 hierarchical.cpp:2429] Filtered offer with 
> ports:[31000-32000]; mem:9984; disk:367463 on agent 
> 0bd8b628-491d-46a1-a358-6cc902ee2578-S1 for role * of framework 
> 0bd8b628-491d-46a1-a358-6cc902ee2578-0000
> I0203 03:21:06.908924 11004 hierarchical.cpp:2429] Filtered offer with 
> cpus:1; mem:10112; disk:367463; ports:[31000-32000] on agent 
> 0bd8b628-491d-46a1-a358-6cc902ee2578-S2 for role * of framework 
> 0bd8b628-491d-46a1-a358-6cc902ee2578-0000
> I0203 03:21:06.909207 11004 hierarchical.cpp:2429] Filtered offer with 
> ports:[31000-32000]; mem:9984; disk:367463 on agent 
> 0bd8b628-491d-46a1-a358-6cc902ee2578-S0 for role * of framework 
> 0bd8b628-491d-46a1-a358-6cc902ee2578-0000
> I0203 03:21:06.909306 11004 hierarchical.cpp:1517] Performed allocation for 3 
> agents in 1.276217ms
> I0203 03:21:06.945303 10984 leveldb.cpp:347] Persisting action (18 bytes) to 
> leveldb took 73.445285ms
> I0203 03:21:06.945451 10984 leveldb.cpp:423] Deleting ~2 keys from leveldb 
> took 81868ns
> I0203 03:21:06.945477 10984 replica.cpp:712] Persisted action TRUNCATE at 
> position 8
> Traceback (most recent call last):
> File "/mesos/mesos-1.6.0/_build/../src/examples/python/test_executor.py", 
> line 25, in <module>
> from mesos.executor import MesosExecutorDriver
> File "build/bdist.linux-x86_64/egg/mesos/executor/__init__.py", line 17, in 
> <module>
> File "build/bdist.linux-x86_64/egg/mesos/executor/_executor.py", line 7, in 
> <module>
> File "build/bdist.linux-x86_64/egg/mesos/executor/_executor.py", line 4, in 
> __bootstrap__
> File 
> "/mesos/mesos-1.6.0/_build/3rdparty/setuptools-20.9.0/pkg_resources/__init__.py",
>  line 1172, in resource_filename
> self, resource_name
> File 
> "/mesos/mesos-1.6.0/_build/3rdparty/setuptools-20.9.0/pkg_resources/__init__.py",
>  line 1716, in get_resource_filename
> self._extract_resource(manager, self._eager_to_zip(name))
> File 
> "/mesos/mesos-1.6.0/_build/3rdparty/setuptools-20.9.0/pkg_resources/__init__.py",
>  line 1746, in _extract_resource
> self.egg_name, self._parts(zip_path)
> File 
> "/mesos/mesos-1.6.0/_build/3rdparty/setuptools-20.9.0/pkg_resources/__init__.py",
>  line 1239, in get_cache_path
> self.extraction_error()
> File 
> "/mesos/mesos-1.6.0/_build/3rdparty/setuptools-20.9.0/pkg_resources/__init__.py",
>  line 1219, in extraction_error
> raise err
> pkg_resources.ExtractionError: Can't extract file(s) to egg cache
> The following error occurred while trying to extract file(s) to the Python egg
> cache:
> [Errno 17] File exists: 
> '/home/mesos/.python-eggs/mesos.executor-1.6.0-py2.7-linux-x86_64.egg-tmp'
> The Python egg cache directory is currently set to:
> /home/mesos/.python-eggs
> Perhaps your account does not have write access to this directory? You can
> change the cache directory by setting the PYTHON_EGG_CACHE environment
> variable to point to an accessible directory.{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to