[jira] [Commented] (MESOS-3821) DOCKER_HOST does not work well with --executor_environment_variables
[ https://issues.apache.org/jira/browse/MESOS-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122782#comment-16122782 ] Huitse Tai commented on MESOS-3821: --- hi guys, will this issue be solved or not?! > DOCKER_HOST does not work well with --executor_environment_variables > > > Key: MESOS-3821 > URL: https://issues.apache.org/jira/browse/MESOS-3821 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 0.25.0 > Environment: Docker 1.7.1 > Mesos 0.25.0 >Reporter: Lei Xu >Assignee: haosdent > > Hi guys, > I found that DOCKER_HOST does not work now if I set > bq. --executor_environment_variables={"DOCKER_HOST":"localhost:2377"} > but the docker executor always append > bq. -H unix:///var/run/docker.sock > on each command, it will overwrite the DOCKER_HOST in fact. > I think it is too strict now, and I could not disable it via some command > flags. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7880) Add an option to skip the Mesos style check when applying a review chain.
[ https://issues.apache.org/jira/browse/MESOS-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-7880: --- Sprint: Mesosphere Sprint 61 > Add an option to skip the Mesos style check when applying a review chain. > - > > Key: MESOS-7880 > URL: https://issues.apache.org/jira/browse/MESOS-7880 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Minor > > The following pre-commit hook would prevent us from committing a patch file > for a third-party library that violates the Mesos style guide and fail > {{support/apply-reviews.py}}. > https://github.com/apache/mesos/blob/master/support/hooks/pre-commit#L24 > As a workaround, we could add a new option to skip the pre-commit hook. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7809) Building gRPC with Autotools
[ https://issues.apache.org/jira/browse/MESOS-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-7809: --- Labels: storage (was: ) > Building gRPC with Autotools > > > Key: MESOS-7809 > URL: https://issues.apache.org/jira/browse/MESOS-7809 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao > Labels: storage > > grpc does not come with an autotools script and have a hand-written makefile > which assumes certain libraries pre-installed in the system. We need to write > proper rules that override the default path options in grpc's Makefile in our > autotools configurations to support grpc in autotools. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7870) Refactor libssl and libcrypto checks for building gRPC
[ https://issues.apache.org/jira/browse/MESOS-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-7870: --- Labels: storage (was: ) > Refactor libssl and libcrypto checks for building gRPC > -- > > Key: MESOS-7870 > URL: https://issues.apache.org/jira/browse/MESOS-7870 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao > Labels: storage > Fix For: 1.5.0 > > > Refactoring library checks for OpenSSL such that they are decoupled from the > `--enable-ssl` flags, due to the dependency between OpenSSL and gRPC. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7810) gRPC support in libprocess
[ https://issues.apache.org/jira/browse/MESOS-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-7810: --- Labels: storage (was: ) > gRPC support in libprocess > -- > > Key: MESOS-7810 > URL: https://issues.apache.org/jira/browse/MESOS-7810 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao > Labels: storage > > We would like to introduce a grpc wrapper in libprocess. The wrapper provides > a clean interface for gRPC asynchronous calls and returns a {{Future}}, so > others can easily use actor-based programming with libprocess to support grpc > communications. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7808) Bundling gRPC into 3rdparty
[ https://issues.apache.org/jira/browse/MESOS-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-7808: --- Labels: storage (was: ) > Bundling gRPC into 3rdparty > --- > > Key: MESOS-7808 > URL: https://issues.apache.org/jira/browse/MESOS-7808 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao > Labels: storage > Fix For: 1.4.0 > > > grpc comes with a hand-written makefile and cmake file, but no autotool > configuration scripts. As a first step to support grpc in mesos, we could > integrate gRPC into our cmake build process under Linux, and make it a > dependency of libprocess. Since it also depends on protobuf, this will create > a triangular dependency between protobuf, grpc and libprocess, so the > existing build configurations needs to be adjusted as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7881) Building gRPC with CMake
Chun-Hung Hsiao created MESOS-7881: -- Summary: Building gRPC with CMake Key: MESOS-7881 URL: https://issues.apache.org/jira/browse/MESOS-7881 Project: Mesos Issue Type: Improvement Reporter: Chun-Hung Hsiao Assignee: Chun-Hung Hsiao Fix For: 1.4.0 gRPC manages its own third-party libraries, which overlap with Mesos' third-party library bundles. We need to write proper rules in CMake to configure gRPC's CMake properly to build it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7808) Bundling gRPC into 3rdparty
[ https://issues.apache.org/jira/browse/MESOS-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-7808: --- Summary: Bundling gRPC into 3rdparty (was: Bundling gRPC into 3rdparty with CMake under Linux) > Bundling gRPC into 3rdparty > --- > > Key: MESOS-7808 > URL: https://issues.apache.org/jira/browse/MESOS-7808 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao > > grpc comes with a hand-written makefile and cmake file, but no autotool > configuration scripts. As a first step to support grpc in mesos, we could > integrate gRPC into our cmake build process under Linux, and make it a > dependency of libprocess. Since it also depends on protobuf, this will create > a triangular dependency between protobuf, grpc and libprocess, so the > existing build configurations needs to be adjusted as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Issue Comment Deleted] (MESOS-7869) Build fails with `--disable-zlib` or `--with-zlib=DIR`
[ https://issues.apache.org/jira/browse/MESOS-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-7869: --- Comment: was deleted (was: {noformat} commit 30914ea9445e2ec3eb48e2daad814accca8f404c Author: Chun-Hung HsiaoDate: Tue Aug 8 13:39:33 2017 -0700 Removed `--disable-zlib` and fixed `--with-zlib` for Mesos. For third-party libraries that does not support `--with-zlib=DIR`, we introduce new variables `ZLIB_CPPFLAGS` and `ZLIB_LINKERFLAGS` so that they can be used to set up `CPPFLAGS` and `LDFLAGS` when building those libraries. Review: https://reviews.apache.org/r/61508 {noformat} {noformat} commit 7a385a464fe1b76bbd7b3009d8f043fbe0eff6f9 Author: Chun-Hung Hsiao Date: Tue Aug 8 13:44:11 2017 -0700 Removed `--disable-zlib` and fixed `--with-zlib` for libprocess. Added `--with-zlib` for specifying a custom zlib path. For third-party libraries that does not support `--with-zlib=DIR`, we introduce new variables `ZLIB_CPPFLAGS` and `ZLIB_LINKERFLAGS` so that they can be used to set up `CPPFLAGS` and `LDFLAGS` when building those libraries. Review: https://reviews.apache.org/r/61509 {noformat}) > Build fails with `--disable-zlib` or `--with-zlib=DIR` > -- > > Key: MESOS-7869 > URL: https://issues.apache.org/jira/browse/MESOS-7869 > Project: Mesos > Issue Type: Bug >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao > Fix For: 1.4.0 > > > ZLib has been a required library for Mesos and libprocess, and > {{--disable-zlib}} is not working anymore so should be removed. > Also, when {{--with-zlib=DIR}} is specified, the protobuf build will fail > because it does not support specifying a customized zlib path through > {{--with-zlib}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7869) Build fails with `--disable-zlib` or `--with-zlib=DIR`
[ https://issues.apache.org/jira/browse/MESOS-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-7869: --- {noformat} commit 30914ea9445e2ec3eb48e2daad814accca8f404c Author: Chun-Hung HsiaoDate: Tue Aug 8 13:39:33 2017 -0700 Removed `--disable-zlib` and fixed `--with-zlib` for Mesos. For third-party libraries that does not support `--with-zlib=DIR`, we introduce new variables `ZLIB_CPPFLAGS` and `ZLIB_LINKERFLAGS` so that they can be used to set up `CPPFLAGS` and `LDFLAGS` when building those libraries. Review: https://reviews.apache.org/r/61508 {noformat} {noformat} commit 7a385a464fe1b76bbd7b3009d8f043fbe0eff6f9 Author: Chun-Hung Hsiao Date: Tue Aug 8 13:44:11 2017 -0700 Removed `--disable-zlib` and fixed `--with-zlib` for libprocess. Added `--with-zlib` for specifying a custom zlib path. For third-party libraries that does not support `--with-zlib=DIR`, we introduce new variables `ZLIB_CPPFLAGS` and `ZLIB_LINKERFLAGS` so that they can be used to set up `CPPFLAGS` and `LDFLAGS` when building those libraries. Review: https://reviews.apache.org/r/61509 {noformat} > Build fails with `--disable-zlib` or `--with-zlib=DIR` > -- > > Key: MESOS-7869 > URL: https://issues.apache.org/jira/browse/MESOS-7869 > Project: Mesos > Issue Type: Bug >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao > Fix For: 1.4.0 > > > ZLib has been a required library for Mesos and libprocess, and > {{--disable-zlib}} is not working anymore so should be removed. > Also, when {{--with-zlib=DIR}} is specified, the protobuf build will fail > because it does not support specifying a customized zlib path through > {{--with-zlib}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7880) Add an option to skip the Mesos style check when applying a review chain.
[ https://issues.apache.org/jira/browse/MESOS-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-7880: --- Story Points: 1 > Add an option to skip the Mesos style check when applying a review chain. > - > > Key: MESOS-7880 > URL: https://issues.apache.org/jira/browse/MESOS-7880 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Minor > > The following pre-commit hook would prevent us from committing a patch file > for a third-party library that violates the Mesos style guide and fail > {{support/apply-reviews.py}}. > https://github.com/apache/mesos/blob/master/support/hooks/pre-commit#L24 > As a workaround, we could add a new option to skip the pre-commit hook. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7880) Add an option to skip the Mesos style check when applying a review chain.
Chun-Hung Hsiao created MESOS-7880: -- Summary: Add an option to skip the Mesos style check when applying a review chain. Key: MESOS-7880 URL: https://issues.apache.org/jira/browse/MESOS-7880 Project: Mesos Issue Type: Improvement Reporter: Chun-Hung Hsiao Assignee: Chun-Hung Hsiao Priority: Minor The following pre-commit hook would prevent us from committing a patch file for a third-party library that violates the Mesos style guide and fail {{support/apply-reviews.py}}. https://github.com/apache/mesos/blob/master/support/hooks/pre-commit#L24 As a workaround, we could add a new option to skip the pre-commit hook. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-6743) Docker executor hangs forever if `docker stop` fails.
[ https://issues.apache.org/jira/browse/MESOS-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6743: --- Fix Version/s: 1.1.3 {noformat} Commit: 4d2afc50c88afff1c197720fa507637def4d2f20 [4d2afc5] Author: Andrei Budnik abud...@mesosphere.com Date: 10 August 2017 at 18:52:51 GMT+2 Committer: Alexander Rukletsov al...@apache.org Commit Date: 10 August 2017 at 22:46:35 GMT+2 Added logging in docker executor on docker stop failure. Review: https://reviews.apache.org/r/61435/ {noformat} {noformat} Commit: 06dcbd7b7c876a1f90934a679e2514d012df4d37 [06dcbd7] Author: Andrei Budnik abud...@mesosphere.com Date: 10 August 2017 at 18:53:03 GMT+2 Committer: Alexander Rukletsov al...@apache.org Commit Date: 10 August 2017 at 22:46:35 GMT+2 Enabled retries for killTask in docker executor. Previously, after docker stop command failure, docker executor neither allowed a scheduler to retry killTask command, nor retried killTask when task kill was triggered by a failed health check. Review: https://reviews.apache.org/r/61530/ {noformat} > Docker executor hangs forever if `docker stop` fails. > - > > Key: MESOS-6743 > URL: https://issues.apache.org/jira/browse/MESOS-6743 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 1.0.1, 1.1.0, 1.2.1, 1.3.0 >Reporter: Alexander Rukletsov >Assignee: Andrei Budnik >Priority: Critical > Labels: mesosphere, reliability > Fix For: 1.1.3 > > > If {{docker stop}} finishes with an error status, the executor should catch > this and react instead of indefinitely waiting for {{reaped}} to return. > An interesting question is _how_ to react. Here are possible solutions. > 1. Retry {{docker stop}}. In this case it is unclear how many times to retry > and what to do if {{docker stop}} continues to fail. > 2. Unmark task as {{killed}}. This will allow frameworks to retry the kill. > However, in this case it is unclear what status updates we should send: > {{TASK_KILLING}} for every kill retry? an extra update when we failed to kill > a task? or set a specific reason in {{TASK_KILLING}}? > 3. Clean up and exit. In this case we should make sure the task container is > killed or notify the framework and the operator that the container may still > be running. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-4969) improve overlayfs detection
[ https://issues.apache.org/jira/browse/MESOS-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122375#comment-16122375 ] Aaron Wood edited comment on MESOS-4969 at 8/10/17 9:34 PM: Sorry, I thought Mesos was only looking at {{/proc/modules}}. was (Author: aaron.wood): Sorry, I thought Mesos was only looking at {noformat}/proc/modules{noformat}. > improve overlayfs detection > --- > > Key: MESOS-4969 > URL: https://issues.apache.org/jira/browse/MESOS-4969 > Project: Mesos > Issue Type: Bug > Components: containerization, storage >Reporter: James Peach >Priority: Minor > > On my Fedora 23, overlayfs is a module that is not loaded by default > (attempting to mount an overlayfs automatically triggers the module loading). > However {{mesos-slave}} won't start until I manually load the module since it > is not listed in {{/proc/filesystems}} until is it loaded. > It would be nice if there was a more reliable way to determine overlayfs > support. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-4969) improve overlayfs detection
[ https://issues.apache.org/jira/browse/MESOS-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122375#comment-16122375 ] Aaron Wood commented on MESOS-4969: --- Sorry, I thought Mesos was only looking at {noformat}/proc/modules{noformat}. > improve overlayfs detection > --- > > Key: MESOS-4969 > URL: https://issues.apache.org/jira/browse/MESOS-4969 > Project: Mesos > Issue Type: Bug > Components: containerization, storage >Reporter: James Peach >Priority: Minor > > On my Fedora 23, overlayfs is a module that is not loaded by default > (attempting to mount an overlayfs automatically triggers the module loading). > However {{mesos-slave}} won't start until I manually load the module since it > is not listed in {{/proc/filesystems}} until is it loaded. > It would be nice if there was a more reliable way to determine overlayfs > support. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-1719) Master should persist active frameworks information
[ https://issues.apache.org/jira/browse/MESOS-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122328#comment-16122328 ] Yan Xu commented on MESOS-1719: --- [~adam-mesos] does this being labelled {{mesosphere}} mean this is on your roadmap in the near to medium term? > Master should persist active frameworks information > --- > > Key: MESOS-1719 > URL: https://issues.apache.org/jira/browse/MESOS-1719 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Vinod Kone >Assignee: Yongqiao Wang > Labels: mesosphere, reliability > > https://issues.apache.org/jira/browse/MESOS-1219 disallows completed > frameworks from re-registering with the same framework id, as long as the > master doesn't failover. > This ticket tracks the work for it work across the master failover using > registrar. > There are some open questions that need to be addressed: > --> Should registry contain framework ids only framework infos. > For disallowing completed frameworks from re-registering, persisting > framework ids is enough. But, if in the future, we want to disallow > frameworks from re-registering if some parts of framework info > changed then we need to persist the info too. > --> How to update the framework info. > Currently frameworks are allowed to update framework info while re- > registering, but it only takes effect on the master when the master > fails > over and on the slave when the slave fails over. How should things >change when persist framework info? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7872) Scheduler hang when registration fails (due to bad role)
[ https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7872: --- Sprint: Mesosphere Sprint 61 Story Points: 3 > Scheduler hang when registration fails (due to bad role) > > > Key: MESOS-7872 > URL: https://issues.apache.org/jira/browse/MESOS-7872 > Project: Mesos > Issue Type: Bug > Components: scheduler driver >Affects Versions: 1.4.0 >Reporter: Till Toenshoff >Assignee: Alexander Rukletsov > Labels: framework, reliability, scheduler > > I'm finding that if framework registration fails, the mesos driver client > will hang indefinitely with the following output: > {noformat} > I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' > is not a valid role: Role '/test/role/slashes' cannot start with a slash' > I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver > I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework > {noformat} > I'd have expected one or both of the following: > - SchedulerDriver.run() should have exited with a failed Proto.Status of some > form > - Scheduler.error() should have been invoked when the "Got error" occurred > Steps to reproduce: > - Launch a scheduler instance, have it register with a known-bad framework > info. In this case a role containing slashes was used > - Observe that the scheduler continues in a TASK_RUNNING state despite the > failed registration. From all appearances it looks like the Scheduler > implementation isn't invoked at all > I'd guess that because this failure happens before framework registration, > there's some error handling that isn't fully initialized at this point. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7872) Scheduler hang when registration fails (due to bad role)
[ https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7872: --- Shepherd: Anand Mazumdar > Scheduler hang when registration fails (due to bad role) > > > Key: MESOS-7872 > URL: https://issues.apache.org/jira/browse/MESOS-7872 > Project: Mesos > Issue Type: Bug > Components: scheduler driver >Affects Versions: 1.4.0 >Reporter: Till Toenshoff >Assignee: Alexander Rukletsov > Labels: framework, reliability, scheduler > > I'm finding that if framework registration fails, the mesos driver client > will hang indefinitely with the following output: > {noformat} > I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' > is not a valid role: Role '/test/role/slashes' cannot start with a slash' > I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver > I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework > {noformat} > I'd have expected one or both of the following: > - SchedulerDriver.run() should have exited with a failed Proto.Status of some > form > - Scheduler.error() should have been invoked when the "Got error" occurred > Steps to reproduce: > - Launch a scheduler instance, have it register with a known-bad framework > info. In this case a role containing slashes was used > - Observe that the scheduler continues in a TASK_RUNNING state despite the > failed registration. From all appearances it looks like the Scheduler > implementation isn't invoked at all > I'd guess that because this failure happens before framework registration, > there's some error handling that isn't fully initialized at this point. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-7872) Scheduler hang when registration fails (due to bad role)
[ https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-7872: -- Assignee: Alexander Rukletsov > Scheduler hang when registration fails (due to bad role) > > > Key: MESOS-7872 > URL: https://issues.apache.org/jira/browse/MESOS-7872 > Project: Mesos > Issue Type: Bug > Components: scheduler driver >Affects Versions: 1.4.0 >Reporter: Till Toenshoff >Assignee: Alexander Rukletsov > Labels: framework, reliability, scheduler > > I'm finding that if framework registration fails, the mesos driver client > will hang indefinitely with the following output: > {noformat} > I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' > is not a valid role: Role '/test/role/slashes' cannot start with a slash' > I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver > I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework > {noformat} > I'd have expected one or both of the following: > - SchedulerDriver.run() should have exited with a failed Proto.Status of some > form > - Scheduler.error() should have been invoked when the "Got error" occurred > Steps to reproduce: > - Launch a scheduler instance, have it register with a known-bad framework > info. In this case a role containing slashes was used > - Observe that the scheduler continues in a TASK_RUNNING state despite the > failed registration. From all appearances it looks like the Scheduler > implementation isn't invoked at all > I'd guess that because this failure happens before framework registration, > there's some error handling that isn't fully initialized at this point. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7879) The kill nested container call should provide ability to specify a signal.
Anand Mazumdar created MESOS-7879: - Summary: The kill nested container call should provide ability to specify a signal. Key: MESOS-7879 URL: https://issues.apache.org/jira/browse/MESOS-7879 Project: Mesos Issue Type: Task Reporter: Anand Mazumdar Assignee: Anand Mazumdar Currently, the {{KILL_NESTED_CONTAINER}} only sends the SIGKILL signal to a running container. We should make it configurable and then make the default executor specify it i.e., initially send SIGTERM followed by a SIGTERM signal. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Issue Comment Deleted] (MESOS-7872) Scheduler hang when registration fails (due to bad role)
[ https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7872: --- Comment: was deleted (was: The problem is likely in the HTTP adapter. [Java side of the adapter|https://github.com/mesosphere/mesos-http-adapter/blob/master/src/main/java/com/mesosphere/mesos/http-adapter/MesosToSchedulerDriverAdapter.java] sends a {{SUBSCRIBE}} request that never completes, due to an error. That error is transferred to the [C++ side of the adapter|https://github.com/apache/mesos/blob/364abfc1bed8543b984ebd3712047b5ed8a109d2/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp#L550], but is not transmitted to the java side, because {{SUBSCRIBED}} [has not succeeded|https://github.com/apache/mesos/blob/364abfc1bed8543b984ebd3712047b5ed8a109d2/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp#L699] yet! Deadlock. A fix here would be allowing {{ERROR}} events to go through even if the scheduler has not subscribed yet.) > Scheduler hang when registration fails (due to bad role) > > > Key: MESOS-7872 > URL: https://issues.apache.org/jira/browse/MESOS-7872 > Project: Mesos > Issue Type: Bug > Components: scheduler driver >Affects Versions: 1.4.0 >Reporter: Till Toenshoff > Labels: framework, reliability, scheduler > > I'm finding that if framework registration fails, the mesos driver client > will hang indefinitely with the following output: > {noformat} > I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' > is not a valid role: Role '/test/role/slashes' cannot start with a slash' > I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver > I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework > {noformat} > I'd have expected one or both of the following: > - SchedulerDriver.run() should have exited with a failed Proto.Status of some > form > - Scheduler.error() should have been invoked when the "Got error" occurred > Steps to reproduce: > - Launch a scheduler instance, have it register with a known-bad framework > info. In this case a role containing slashes was used > - Observe that the scheduler continues in a TASK_RUNNING state despite the > failed registration. From all appearances it looks like the Scheduler > implementation isn't invoked at all > I'd guess that because this failure happens before framework registration, > there's some error handling that isn't fully initialized at this point. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7872) Scheduler hang when registration fails (due to bad role)
[ https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122058#comment-16122058 ] Alexander Rukletsov commented on MESOS-7872: The problem is likely in the HTTP adapter. [Java side of the adapter|https://github.com/mesosphere/mesos-http-adapter/blob/master/src/main/java/com/mesosphere/mesos/http-adapter/MesosToSchedulerDriverAdapter.java] sends a {{SUBSCRIBE}} request that never completes, due to an error. That error is transferred to the [C++ side of the adapter|https://github.com/apache/mesos/blob/364abfc1bed8543b984ebd3712047b5ed8a109d2/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp#L550], but is not transmitted to the java side, because {{SUBSCRIBED}} [has not succeeded|https://github.com/apache/mesos/blob/364abfc1bed8543b984ebd3712047b5ed8a109d2/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp#L699] yet! Deadlock. A fix here would be allowing {{ERROR}} events to go through even if the scheduler has not subscribed yet. > Scheduler hang when registration fails (due to bad role) > > > Key: MESOS-7872 > URL: https://issues.apache.org/jira/browse/MESOS-7872 > Project: Mesos > Issue Type: Bug > Components: scheduler driver >Affects Versions: 1.4.0 >Reporter: Till Toenshoff > Labels: framework, reliability, scheduler > > I'm finding that if framework registration fails, the mesos driver client > will hang indefinitely with the following output: > {noformat} > I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' > is not a valid role: Role '/test/role/slashes' cannot start with a slash' > I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver > I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework > {noformat} > I'd have expected one or both of the following: > - SchedulerDriver.run() should have exited with a failed Proto.Status of some > form > - Scheduler.error() should have been invoked when the "Got error" occurred > Steps to reproduce: > - Launch a scheduler instance, have it register with a known-bad framework > info. In this case a role containing slashes was used > - Observe that the scheduler continues in a TASK_RUNNING state despite the > failed registration. From all appearances it looks like the Scheduler > implementation isn't invoked at all > I'd guess that because this failure happens before framework registration, > there's some error handling that isn't fully initialized at this point. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-6950) Launching two tasks with the same Docker image simultaneously may cause a staging dir never cleaned up
[ https://issues.apache.org/jira/browse/MESOS-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilbert Song updated MESOS-6950: Labels: containerizer mesosphere (was: ) > Launching two tasks with the same Docker image simultaneously may cause a > staging dir never cleaned up > -- > > Key: MESOS-6950 > URL: https://issues.apache.org/jira/browse/MESOS-6950 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang > Labels: containerizer, mesosphere > > If user launches two tasks with the same Docker image simultaneously (e.g., > run {{mesos-executor}} twice with the same Docker image), there will be a > staging directory which is for the second task never cleaned up, like this: > {code} > └── store > └── docker > ├── layers > │... > ├── staging > │ └── a6rXWC > └── storedImages > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7874) Provide a consistent non-blocking preLaunch hook
[ https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-7874: - Description: Our use case: we need a non-blocking way to notify our secret management system during task launching sequence on agent. This mechanism needs to work for both {{DockerContainerizer}} and {{MesosContainerizer}}, and both {{custom executor}} and {{command executor}}, with proper access to labels on {{TaskInfo}}. As of 1.3.0, the hooks in [hook.hpp | https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty inconsistent on these combination cases. The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however it has a couple of problems: 1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends a `None()` instead; 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's because people can implement an {{isolator}}? However, it creates extra work for module authors and operators. The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: 1. Error are silently swallowed so module cannot stop the task running sequence; 2. It's a blocking version, which means we cannot wait for another subprocess's or RPC result. I'm inclined to fix the two problems on {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. was: Our use case: we need a non-blocking prelaunch hook to integrate with our own secret management system, and this hook needs to work under both {{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom executor}} and {{command executor}}, with proper access to {{TaskInfo}} (actually certain labels on it). As of 1.3.0, the hooks in [hook.hpp | https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty inconsistent on these combination cases. The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however it has a couple of problems: 1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends a `None()` instead; 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's because people can implement an {{isolator}}? However, it creates extra work for module authors and operators. The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: 1. Error are silently swallowed so module cannot stop the task running sequence; 2. It's a blocking version, which means we cannot wait for another subprocess's or RPC result. I'm inclined to fix the two problems on {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. > Provide a consistent non-blocking preLaunch hook > > > Key: MESOS-7874 > URL: https://issues.apache.org/jira/browse/MESOS-7874 > Project: Mesos > Issue Type: Improvement > Components: modules >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: hooks, module > > Our use case: we need a non-blocking way to notify our secret management > system during task launching sequence on agent. This mechanism needs to work > for both {{DockerContainerizer}} and {{MesosContainerizer}}, and both > {{custom executor}} and {{command executor}}, with proper access to labels on > {{TaskInfo}}. > As of 1.3.0, the hooks in [hook.hpp | > https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty > inconsistent on these combination cases. > The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, > however it has a couple of problems: > 1. For DockerContainerizer + custom executor, it strips away TaskInfo and > sends a `None()` instead; > 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's > because people can implement an {{isolator}}? However, it creates extra work > for module authors and operators. > The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems: > 1. Error are silently swallowed so module cannot stop the task running > sequence; > 2. It's a blocking version, which means we cannot wait for another > subprocess's or RPC result. > I'm inclined to fix the two problems on > {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-5482) mesos/marathon task stuck in staging after slave reboot
[ https://issues.apache.org/jira/browse/MESOS-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121944#comment-16121944 ] Mao Geng commented on MESOS-5482: - Hit this issue on mesos 1.2.0 and marathon 1.4.3 too. The agent timed out the ping for 75secs, then reconnected {quote} I0810 13:18:43.142431 18394 slave.cpp:4378] No pings from master received within 75secs I0810 13:18:43.142588 18393 slave.cpp:920] Re-detecting master I0810 13:18:43.142614 18393 slave.cpp:966] Detecting new master I0810 13:18:43.142674 18407 status_update_manager.cpp:177] Pausing sending status updates I0810 13:18:43.142755 18420 status_update_manager.cpp:177] Pausing sending status updates I0810 13:18:43.142813 18415 slave.cpp:931] New master detected at master@10.1.36.4:5050 I0810 13:18:43.142840 18415 slave.cpp:955] No credentials provided. Attempting to register without authentication I0810 13:18:43.142853 18415 slave.cpp:966] Detecting new master I0810 13:18:44.431833 18415 slave.cpp:1242] Re-registered with master master@10.1.36.4:5050 I0810 13:18:44.431874 18415 slave.cpp:1279] Forwarding total oversubscribed resources {} I0810 13:18:44.431895 18398 status_update_manager.cpp:184] Resuming sending status updates I0810 13:18:44.433912 18386 slave.cpp:2683] Shutting down framework f853458f-b07b-4b79-8192-24953f474369- I0810 13:18:44.433939 18386 slave.cpp:5083] Shutting down executor 'metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5' of framework f853458f-b07b-4b79-8192-24953f474369- at executor(1)@10.1.98.251:33041 W0810 13:18:44.435637 18440 slave.cpp:2823] Ignoring updating pid for framework f853458f-b07b-4b79-8192-24953f474369- because it is terminating I0810 13:18:46.878993 18408 slave.cpp:1625] Got assigned task 'metrics_statsd.70dff634-7dce-11e7-bea2-0242f4eb80ac' for framework f853458f-b07b-4b79-8192-24953f474369- I0810 13:18:46.879406 18408 slave.cpp:1785] Launching task 'metrics_statsd.70dff634-7dce-11e7-bea2-0242f4eb80ac' for framework f853458f-b07b-4b79-8192-24953f474369- W0810 13:18:46.879436 18408 slave.cpp:1853] Ignoring running task 'metrics_statsd.70dff634-7dce-11e7-bea2-0242f4eb80ac' of framework f853458f-b07b-4b79-8192-24953f474369- because the framework is terminating I0810 13:18:47.613224 18415 slave.cpp:3816] Handling status update TASK_KILLED (UUID: af78fc5c-8552-4aee-abae-cda3d0ec2909) for task metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5 of framework f853458f-b07b-4b79-8192-24953f474369- from executor(1)@10.1.98.251:33041 W0810 13:18:47.613261 18415 slave.cpp:3885] Ignoring status update TASK_KILLED (UUID: af78fc5c-8552-4aee-abae-cda3d0ec2909) for task metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5 of framework f853458f-b07b-4b79-8192-24953f474369- for terminating framework f853458f-b07b-4b79-8192-24953f474369- I0810 13:18:48.618629 18409 slave.cpp:4388] Got exited event for executor(1)@10.1.98.251:33041 I0810 13:18:48.713826 18390 docker.cpp:2358] Executor for container 1f351db2-1011-4244-83c2-1854c44d7b65 has exited I0810 13:18:48.713850 18390 docker.cpp:2052] Destroying container 1f351db2-1011-4244-83c2-1854c44d7b65 I0810 13:18:48.713892 18390 docker.cpp:2179] Running docker stop on container 1f351db2-1011-4244-83c2-1854c44d7b65 I0810 13:18:48.714363 18411 slave.cpp:4769] Executor 'metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5' of framework f853458f-b07b-4b79-8192-24953f474369- exited with status 0 I0810 13:18:48.714390 18411 slave.cpp:4869] Cleaning up executor 'metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5' of framework f853458f-b07b-4b79-8192-24953f474369- at executor(1)@10.1.98.251:33041 I0810 13:18:48.714589 18411 slave.cpp:4957] Cleaning up framework f853458f-b07b-4b79-8192-24953f474369- I0810 13:18:48.714607 18432 gc.cpp:55] Scheduling '/mnt/mesos/slaves/508bde0b-4661-4a29-b674-32163345096f-S229/frameworks/f853458f-b07b-4b79-8192-24953f474369-/executors/metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5/runs/1f351db2-1011-4244-83c2-1854c44d7b65' for gc 6.9173026667days in the future I0810 13:18:48.714669 18410 status_update_manager.cpp:285] Closing status update streams for framework f853458f-b07b-4b79-8192-24953f474369- I0810 13:18:48.714679 18432 gc.cpp:55] Scheduling '/mnt/mesos/slaves/508bde0b-4661-4a29-b674-32163345096f-S229/frameworks/f853458f-b07b-4b79-8192-24953f474369-/executors/metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5' for gc 6.9172979259days in the future I0810 13:18:48.714709 18432 gc.cpp:55] Scheduling '/mnt/mesos/meta/slaves/508bde0b-4661-4a29-b674-32163345096f-S229/frameworks/f853458f-b07b-4b79-8192-24953f474369-/executors/metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5/runs/1f351db2-1011-4244-83c2-1854c44d7b65' for gc 6.9172953778days in the future I0810 13:18:48.714725 18432 gc.cpp:55] Scheduling
[jira] [Commented] (MESOS-6390) Ensure Python support scripts are linted
[ https://issues.apache.org/jira/browse/MESOS-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121894#comment-16121894 ] Joseph Wu commented on MESOS-6390: -- {code} commit d04ab2096169513561d20a414c67ed1aaed0ecd7 Author: Armand GrilletDate: Thu Aug 10 09:38:43 2017 -0700 Linted support/test-upgrade.py. This will allow us to use PyLint on the entire support directory in the future. Review: https://reviews.apache.org/r/60235/ {code} > Ensure Python support scripts are linted > > > Key: MESOS-6390 > URL: https://issues.apache.org/jira/browse/MESOS-6390 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Bannier >Assignee: Armand Grillet > Labels: newbie, python > > Currently {{support/mesos-style.py}} does not lint files under {{support/}}. > This is mostly due to the fact that these scripts are too inconsistent > style-wise that they wouldn't even pass the linter now. > We should clean up all Python scripts under {{support/}} so they pass the > Python linter, and activate that directory in the linter for future > additions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7878) Add default value for http_framework_authenticators flag
Zhitao Li created MESOS-7878: Summary: Add default value for http_framework_authenticators flag Key: MESOS-7878 URL: https://issues.apache.org/jira/browse/MESOS-7878 Project: Mesos Issue Type: Improvement Reporter: Zhitao Li Priority: Minor Based on http://mesos.apache.org/documentation/latest/configuration/, {{http_authenticator}} has a default value {{basic}} but {{http_framework_authenticators}} does not one. Given that people running default Mesos distribution only has {{basic}} available, I feel that we should add a default value to this flag to avoid surprise to operators when they turn on http framework. Proposing Greg to shepherd. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7877) Audit test code for undefined behavior in accessing container elements
[ https://issues.apache.org/jira/browse/MESOS-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7877: --- Labels: mesosphere newbie tech-debt test (was: mesosphere newbie tech-debt) > Audit test code for undefined behavior in accessing container elements > -- > > Key: MESOS-7877 > URL: https://issues.apache.org/jira/browse/MESOS-7877 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Bannier > Labels: mesosphere, newbie, tech-debt, test > > We do not always make sure we never access elements from empty containers, > e.g., we use patterns like the following > {code} > Futureoffers; > // Satisfy offers. > EXPECT_FALSE(offers.empty()); > const auto& offer = (*offers)[0]; > {code} > While the intention here is to diagnose an empty {{offers}}, the code still > exhibits undefined behavior in the element access if {{offers}} was indeed > empty (compilers might aggressively exploit undefined behavior to e.g., > remove "impossible" code). Instead one should prevent accessing any elements > of an empty container, e.g., > {code} > ASSERT_FALSE(offers.empty()); // Prevent execution of rest of test body. > {code} > We should audit and fix existing test code for such incorrect checks and > variations involving e.g., {{EXPECT_NE}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7877) Audit test code for undefined behavior in accessing container elements
Benjamin Bannier created MESOS-7877: --- Summary: Audit test code for undefined behavior in accessing container elements Key: MESOS-7877 URL: https://issues.apache.org/jira/browse/MESOS-7877 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Bannier We do not always make sure we never access elements from empty containers, e.g., we use patterns like the following {code} Futureoffers; // Satisfy offers. EXPECT_FALSE(offers.empty()); const auto& offer = (*offers)[0]; {code} While the intention here is to diagnose an empty {{offers}}, the code still exhibits undefined behavior in the element access if {{offers}} was indeed empty (compilers might aggressively exploit undefined behavior to e.g., remove "impossible" code). Instead one should prevent accessing any elements of an empty container, e.g., {code} ASSERT_FALSE(offers.empty()); // Prevent execution of rest of test body. {code} We should audit and fix existing test code for such incorrect checks and variations involving e.g., {{EXPECT_NE}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7872) Scheduler hang when registration fails (due to bad role)
[ https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7872: --- Labels: framework reliability scheduler (was: framework scheduler) Component/s: scheduler driver > Scheduler hang when registration fails (due to bad role) > > > Key: MESOS-7872 > URL: https://issues.apache.org/jira/browse/MESOS-7872 > Project: Mesos > Issue Type: Bug > Components: scheduler driver >Affects Versions: 1.4.0 >Reporter: Till Toenshoff > Labels: framework, reliability, scheduler > > I'm finding that if framework registration fails, the mesos driver client > will hang indefinitely with the following output: > {noformat} > I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' > is not a valid role: Role '/test/role/slashes' cannot start with a slash' > I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver > I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework > {noformat} > I'd have expected one or both of the following: > - SchedulerDriver.run() should have exited with a failed Proto.Status of some > form > - Scheduler.error() should have been invoked when the "Got error" occurred > Steps to reproduce: > - Launch a scheduler instance, have it register with a known-bad framework > info. In this case a role containing slashes was used > - Observe that the scheduler continues in a TASK_RUNNING state despite the > failed registration. From all appearances it looks like the Scheduler > implementation isn't invoked at all > I'd guess that because this failure happens before framework registration, > there's some error handling that isn't fully initialized at this point. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7872) Scheduler hang when registration fails (due to bad role)
[ https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121387#comment-16121387 ] Alexander Rukletsov commented on MESOS-7872: I've tried to reproduce this issue using a slightly modified {{no-executor-framework}}. Here is the output I get: {noformat} alex@alexr: ~/Projects/mesos/build/default $ ./src/no-executor-framework --master=127.0.0.1:5050 I0810 11:55:46.766144 1993596928 sched.cpp:232] Version: 1.4.0 I0810 11:55:46.766348 1993596928 sched.cpp:2090] Awaiting latch I0810 11:55:46.771299 3211264 sched.cpp:336] New master detected at master@127.0.0.1:5050 I0810 11:55:46.774588 3211264 sched.cpp:352] No credentials provided. Attempting to register without authentication I0810 11:55:46.792697 2674688 sched.cpp:1187] Got error ''FrameworkInfo.role' is not a valid role: Role '/test/rt' cannot start with a slash' I0810 11:55:46.792721 2674688 sched.cpp:2055] Asked to abort the driver E0810 11:55:46.792738 2674688 no_executor_framework.cpp:216] 'FrameworkInfo.role' is not a valid role: Role '/test/rt' cannot start with a slash I0810 11:55:46.792752 2674688 sched.cpp:1233] Aborting framework E0810 11:55:46.792788 4820992 process.cpp:2584] Failed to shutdown socket with fd 9, address 192.168.1.113:56500: Socket is not connected I0810 11:55:46.792866 1993596928 sched.cpp:2092] Latch is triggered I0810 11:55:46.792881 1993596928 sched.cpp:2021] Asked to stop the driver {noformat} If I remove [{{driver->stop}}|https://github.com/apache/mesos/blob/2cea83653afcf6d7470242379809645bfe009016/src/examples/no_executor_framework.cpp#L398], the scheduler exits anyway: {noformat} alex@alexr: ~/Projects/mesos/build/default $ ./src/no-executor-framework --master=127.0.0.1:5050 I0810 12:00:46.115882 1993596928 sched.cpp:232] Version: 1.4.0 I0810 12:00:46.116058 1993596928 sched.cpp:2090] Awaiting latch I0810 12:00:46.118584 2674688 sched.cpp:336] New master detected at master@127.0.0.1:5050 I0810 12:00:46.118834 2674688 sched.cpp:352] No credentials provided. Attempting to register without authentication I0810 12:00:46.120816 4284416 sched.cpp:1187] Got error ''FrameworkInfo.role' is not a valid role: Role '/test/role' cannot start with a slash' I0810 12:00:46.120842 4284416 sched.cpp:2055] Asked to abort the driver E0810 12:00:46.120847 4820992 process.cpp:2584] Failed to shutdown socket with fd 9, address 192.168.1.113:57081: Socket is not connected E0810 12:00:46.120869 4284416 no_executor_framework.cpp:216] 'FrameworkInfo.role' is not a valid role: Role '/test/role' cannot start with a slash I0810 12:00:46.120895 4284416 sched.cpp:1233] Aborting framework I0810 12:00:46.121004 1993596928 sched.cpp:2092] Latch is triggered {noformat} Can you share the code of you scheduler, especially the part where you create and wait for the driver? > Scheduler hang when registration fails (due to bad role) > > > Key: MESOS-7872 > URL: https://issues.apache.org/jira/browse/MESOS-7872 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.0 >Reporter: Till Toenshoff > Labels: framework, scheduler > > I'm finding that if framework registration fails, the mesos driver client > will hang indefinitely with the following output: > {noformat} > I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' > is not a valid role: Role '/test/role/slashes' cannot start with a slash' > I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver > I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework > {noformat} > I'd have expected one or both of the following: > - SchedulerDriver.run() should have exited with a failed Proto.Status of some > form > - Scheduler.error() should have been invoked when the "Got error" occurred > Steps to reproduce: > - Launch a scheduler instance, have it register with a known-bad framework > info. In this case a role containing slashes was used > - Observe that the scheduler continues in a TASK_RUNNING state despite the > failed registration. From all appearances it looks like the Scheduler > implementation isn't invoked at all > I'd guess that because this failure happens before framework registration, > there's some error handling that isn't fully initialized at this point. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6950) Launching two tasks with the same Docker image simultaneously may cause a staging dir never cleaned up
[ https://issues.apache.org/jira/browse/MESOS-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121352#comment-16121352 ] Qian Zhang commented on MESOS-6950: --- RR: https://reviews.apache.org/r/61546/ > Launching two tasks with the same Docker image simultaneously may cause a > staging dir never cleaned up > -- > > Key: MESOS-6950 > URL: https://issues.apache.org/jira/browse/MESOS-6950 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang > > If user launches two tasks with the same Docker image simultaneously (e.g., > run {{mesos-executor}} twice with the same Docker image), there will be a > staging directory which is for the second task never cleaned up, like this: > {code} > └── store > └── docker > ├── layers > │... > ├── staging > │ └── a6rXWC > └── storedImages > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7876) Investigate jemalloc as a possible malloc for mesos
Benno Evers created MESOS-7876: -- Summary: Investigate jemalloc as a possible malloc for mesos Key: MESOS-7876 URL: https://issues.apache.org/jira/browse/MESOS-7876 Project: Mesos Issue Type: Improvement Reporter: Benno Evers Assignee: Benno Evers It is currently very hard to debug memory issues, in particular memory leaks, in mesos. An alluring way to improve the situation would be to change the default malloc to jemalloc, which has built-in heap-tracking capabilities. However, some care needs to be taken when considering to change such a fundamental part of mesos: * Would such a switch have any adverse impact on performance? * Is it available and will it compile on all our target platforms? * Is the jemalloc-licensing compatible with bundling as third-party library? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-6950) Launching two tasks with the same Docker image simultaneously may cause a staging dir never cleaned up
[ https://issues.apache.org/jira/browse/MESOS-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang updated MESOS-6950: -- Shepherd: Gilbert Song (was: Qian Zhang) > Launching two tasks with the same Docker image simultaneously may cause a > staging dir never cleaned up > -- > > Key: MESOS-6950 > URL: https://issues.apache.org/jira/browse/MESOS-6950 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang > > If user launches two tasks with the same Docker image simultaneously (e.g., > run {{mesos-executor}} twice with the same Docker image), there will be a > staging directory which is for the second task never cleaned up, like this: > {code} > └── store > └── docker > ├── layers > │... > ├── staging > │ └── a6rXWC > └── storedImages > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-6950) Launching two tasks with the same Docker image simultaneously may cause a staging dir never cleaned up
[ https://issues.apache.org/jira/browse/MESOS-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-6950: - Assignee: Qian Zhang (was: Gilbert Song) > Launching two tasks with the same Docker image simultaneously may cause a > staging dir never cleaned up > -- > > Key: MESOS-6950 > URL: https://issues.apache.org/jira/browse/MESOS-6950 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang > > If user launches two tasks with the same Docker image simultaneously (e.g., > run {{mesos-executor}} twice with the same Docker image), there will be a > staging directory which is for the second task never cleaned up, like this: > {code} > └── store > └── docker > ├── layers > │... > ├── staging > │ └── a6rXWC > └── storedImages > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)