[jira] [Commented] (MESOS-3821) DOCKER_HOST does not work well with --executor_environment_variables

2017-08-10 Thread Huitse Tai (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122782#comment-16122782
 ] 

Huitse Tai commented on MESOS-3821:
---

hi guys, will this issue be solved or not?!

> DOCKER_HOST does not work well with --executor_environment_variables
> 
>
> Key: MESOS-3821
> URL: https://issues.apache.org/jira/browse/MESOS-3821
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.25.0
> Environment: Docker 1.7.1
> Mesos 0.25.0
>Reporter: Lei Xu
>Assignee: haosdent
>
> Hi guys,
> I found that DOCKER_HOST does not work now if I set 
> bq. --executor_environment_variables={"DOCKER_HOST":"localhost:2377"}
> but the docker executor always append 
> bq. -H unix:///var/run/docker.sock 
> on each command, it will overwrite the DOCKER_HOST in fact.
> I think it is too strict now, and I could not disable it via some command 
> flags.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7880) Add an option to skip the Mesos style check when applying a review chain.

2017-08-10 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-7880:
---
Sprint: Mesosphere Sprint 61

> Add an option to skip the Mesos style check when applying a review chain.
> -
>
> Key: MESOS-7880
> URL: https://issues.apache.org/jira/browse/MESOS-7880
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Minor
>
> The following pre-commit hook would prevent us from committing a patch file 
> for a third-party library that violates the Mesos style guide and fail 
> {{support/apply-reviews.py}}.
> https://github.com/apache/mesos/blob/master/support/hooks/pre-commit#L24
> As a workaround, we could add a new option to skip the pre-commit hook.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7809) Building gRPC with Autotools

2017-08-10 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-7809:
---
Labels: storage  (was: )

> Building gRPC with Autotools
> 
>
> Key: MESOS-7809
> URL: https://issues.apache.org/jira/browse/MESOS-7809
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>  Labels: storage
>
> grpc does not come with an autotools script and have a hand-written makefile 
> which assumes certain libraries pre-installed in the system. We need to write 
> proper rules that override the default path options in grpc's Makefile in our 
> autotools configurations to support grpc in autotools.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7870) Refactor libssl and libcrypto checks for building gRPC

2017-08-10 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-7870:
---
Labels: storage  (was: )

> Refactor libssl and libcrypto checks for building gRPC
> --
>
> Key: MESOS-7870
> URL: https://issues.apache.org/jira/browse/MESOS-7870
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>  Labels: storage
> Fix For: 1.5.0
>
>
> Refactoring library checks for OpenSSL such that they are decoupled from the 
> `--enable-ssl` flags, due to the dependency between OpenSSL and gRPC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7810) gRPC support in libprocess

2017-08-10 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-7810:
---
Labels: storage  (was: )

> gRPC support in libprocess
> --
>
> Key: MESOS-7810
> URL: https://issues.apache.org/jira/browse/MESOS-7810
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>  Labels: storage
>
> We would like to introduce a grpc wrapper in libprocess. The wrapper provides 
> a clean interface for gRPC asynchronous calls and returns a {{Future}}, so 
> others can easily use actor-based programming with libprocess to support grpc 
> communications.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7808) Bundling gRPC into 3rdparty

2017-08-10 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-7808:
---
Labels: storage  (was: )

> Bundling gRPC into 3rdparty
> ---
>
> Key: MESOS-7808
> URL: https://issues.apache.org/jira/browse/MESOS-7808
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>  Labels: storage
> Fix For: 1.4.0
>
>
> grpc comes with a hand-written makefile and cmake file, but no autotool 
> configuration scripts. As a first step to support grpc in mesos, we could 
> integrate gRPC into our cmake build process under Linux, and make it a 
> dependency of libprocess. Since it also depends on protobuf, this will create 
> a triangular dependency between protobuf, grpc and libprocess, so the 
> existing build configurations needs to be adjusted as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7881) Building gRPC with CMake

2017-08-10 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-7881:
--

 Summary: Building gRPC with CMake
 Key: MESOS-7881
 URL: https://issues.apache.org/jira/browse/MESOS-7881
 Project: Mesos
  Issue Type: Improvement
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao
 Fix For: 1.4.0


gRPC manages its own third-party libraries, which overlap with Mesos' 
third-party library bundles. We need to write proper rules in CMake to 
configure gRPC's CMake properly to build it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7808) Bundling gRPC into 3rdparty

2017-08-10 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-7808:
---
Summary: Bundling gRPC into 3rdparty  (was: Bundling gRPC into 3rdparty 
with CMake under Linux)

> Bundling gRPC into 3rdparty
> ---
>
> Key: MESOS-7808
> URL: https://issues.apache.org/jira/browse/MESOS-7808
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>
> grpc comes with a hand-written makefile and cmake file, but no autotool 
> configuration scripts. As a first step to support grpc in mesos, we could 
> integrate gRPC into our cmake build process under Linux, and make it a 
> dependency of libprocess. Since it also depends on protobuf, this will create 
> a triangular dependency between protobuf, grpc and libprocess, so the 
> existing build configurations needs to be adjusted as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (MESOS-7869) Build fails with `--disable-zlib` or `--with-zlib=DIR`

2017-08-10 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-7869:
---
Comment: was deleted

(was: {noformat}
commit 30914ea9445e2ec3eb48e2daad814accca8f404c
Author: Chun-Hung Hsiao 
Date:   Tue Aug 8 13:39:33 2017 -0700

Removed `--disable-zlib` and fixed `--with-zlib` for Mesos.

For third-party libraries that does not support `--with-zlib=DIR`, we
introduce new variables `ZLIB_CPPFLAGS` and `ZLIB_LINKERFLAGS` so that
they can be used to set up `CPPFLAGS` and `LDFLAGS` when building those
libraries.

Review: https://reviews.apache.org/r/61508
{noformat}
{noformat}
commit 7a385a464fe1b76bbd7b3009d8f043fbe0eff6f9
Author: Chun-Hung Hsiao 
Date:   Tue Aug 8 13:44:11 2017 -0700

Removed `--disable-zlib` and fixed `--with-zlib` for libprocess.

Added `--with-zlib` for specifying a custom zlib path. For third-party
libraries that does not support `--with-zlib=DIR`, we introduce new
variables `ZLIB_CPPFLAGS` and `ZLIB_LINKERFLAGS` so that they can be
used to set up `CPPFLAGS` and `LDFLAGS` when building those libraries.

Review: https://reviews.apache.org/r/61509
{noformat})

> Build fails with `--disable-zlib` or `--with-zlib=DIR`
> --
>
> Key: MESOS-7869
> URL: https://issues.apache.org/jira/browse/MESOS-7869
> Project: Mesos
>  Issue Type: Bug
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
> Fix For: 1.4.0
>
>
> ZLib has been a required library for Mesos and libprocess, and 
> {{--disable-zlib}} is not working anymore so should be removed.
> Also, when {{--with-zlib=DIR}} is specified, the protobuf build will fail 
> because it does not support specifying a customized zlib path through 
> {{--with-zlib}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7869) Build fails with `--disable-zlib` or `--with-zlib=DIR`

2017-08-10 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-7869:
---

{noformat}
commit 30914ea9445e2ec3eb48e2daad814accca8f404c
Author: Chun-Hung Hsiao 
Date:   Tue Aug 8 13:39:33 2017 -0700

Removed `--disable-zlib` and fixed `--with-zlib` for Mesos.

For third-party libraries that does not support `--with-zlib=DIR`, we
introduce new variables `ZLIB_CPPFLAGS` and `ZLIB_LINKERFLAGS` so that
they can be used to set up `CPPFLAGS` and `LDFLAGS` when building those
libraries.

Review: https://reviews.apache.org/r/61508
{noformat}
{noformat}
commit 7a385a464fe1b76bbd7b3009d8f043fbe0eff6f9
Author: Chun-Hung Hsiao 
Date:   Tue Aug 8 13:44:11 2017 -0700

Removed `--disable-zlib` and fixed `--with-zlib` for libprocess.

Added `--with-zlib` for specifying a custom zlib path. For third-party
libraries that does not support `--with-zlib=DIR`, we introduce new
variables `ZLIB_CPPFLAGS` and `ZLIB_LINKERFLAGS` so that they can be
used to set up `CPPFLAGS` and `LDFLAGS` when building those libraries.

Review: https://reviews.apache.org/r/61509
{noformat}

> Build fails with `--disable-zlib` or `--with-zlib=DIR`
> --
>
> Key: MESOS-7869
> URL: https://issues.apache.org/jira/browse/MESOS-7869
> Project: Mesos
>  Issue Type: Bug
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
> Fix For: 1.4.0
>
>
> ZLib has been a required library for Mesos and libprocess, and 
> {{--disable-zlib}} is not working anymore so should be removed.
> Also, when {{--with-zlib=DIR}} is specified, the protobuf build will fail 
> because it does not support specifying a customized zlib path through 
> {{--with-zlib}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7880) Add an option to skip the Mesos style check when applying a review chain.

2017-08-10 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-7880:
---
Story Points: 1

> Add an option to skip the Mesos style check when applying a review chain.
> -
>
> Key: MESOS-7880
> URL: https://issues.apache.org/jira/browse/MESOS-7880
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Minor
>
> The following pre-commit hook would prevent us from committing a patch file 
> for a third-party library that violates the Mesos style guide and fail 
> {{support/apply-reviews.py}}.
> https://github.com/apache/mesos/blob/master/support/hooks/pre-commit#L24
> As a workaround, we could add a new option to skip the pre-commit hook.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7880) Add an option to skip the Mesos style check when applying a review chain.

2017-08-10 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-7880:
--

 Summary: Add an option to skip the Mesos style check when applying 
a review chain.
 Key: MESOS-7880
 URL: https://issues.apache.org/jira/browse/MESOS-7880
 Project: Mesos
  Issue Type: Improvement
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao
Priority: Minor


The following pre-commit hook would prevent us from committing a patch file for 
a third-party library that violates the Mesos style guide and fail 
{{support/apply-reviews.py}}.
https://github.com/apache/mesos/blob/master/support/hooks/pre-commit#L24
As a workaround, we could add a new option to skip the pre-commit hook.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6743) Docker executor hangs forever if `docker stop` fails.

2017-08-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6743:
---
Fix Version/s: 1.1.3

{noformat}
Commit: 4d2afc50c88afff1c197720fa507637def4d2f20 [4d2afc5]
Author: Andrei Budnik abud...@mesosphere.com
Date: 10 August 2017 at 18:52:51 GMT+2
Committer: Alexander Rukletsov al...@apache.org
Commit Date: 10 August 2017 at 22:46:35 GMT+2

Added logging in docker executor on docker stop failure.

Review: https://reviews.apache.org/r/61435/
{noformat}
{noformat}
Commit: 06dcbd7b7c876a1f90934a679e2514d012df4d37 [06dcbd7]
Author: Andrei Budnik abud...@mesosphere.com
Date: 10 August 2017 at 18:53:03 GMT+2
Committer: Alexander Rukletsov al...@apache.org
Commit Date: 10 August 2017 at 22:46:35 GMT+2

Enabled retries for killTask in docker executor.

Previously, after docker stop command failure, docker executor
neither allowed a scheduler to retry killTask command, nor retried
killTask when task kill was triggered by a failed health check.

Review: https://reviews.apache.org/r/61530/
{noformat}

> Docker executor hangs forever if `docker stop` fails.
> -
>
> Key: MESOS-6743
> URL: https://issues.apache.org/jira/browse/MESOS-6743
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.0.1, 1.1.0, 1.2.1, 1.3.0
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: mesosphere, reliability
> Fix For: 1.1.3
>
>
> If {{docker stop}} finishes with an error status, the executor should catch 
> this and react instead of indefinitely waiting for {{reaped}} to return.
> An interesting question is _how_ to react. Here are possible solutions.
> 1. Retry {{docker stop}}. In this case it is unclear how many times to retry 
> and what to do if {{docker stop}} continues to fail.
> 2. Unmark task as {{killed}}. This will allow frameworks to retry the kill. 
> However, in this case it is unclear what status updates we should send: 
> {{TASK_KILLING}} for every kill retry? an extra update when we failed to kill 
> a task? or set a specific reason in {{TASK_KILLING}}?
> 3. Clean up and exit. In this case we should make sure the task container is 
> killed or notify the framework and the operator that the container may still 
> be running.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-4969) improve overlayfs detection

2017-08-10 Thread Aaron Wood (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122375#comment-16122375
 ] 

Aaron Wood edited comment on MESOS-4969 at 8/10/17 9:34 PM:


Sorry, I thought Mesos was only looking at {{/proc/modules}}.


was (Author: aaron.wood):
Sorry, I thought Mesos was only looking at {noformat}/proc/modules{noformat}.

> improve overlayfs detection
> ---
>
> Key: MESOS-4969
> URL: https://issues.apache.org/jira/browse/MESOS-4969
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, storage
>Reporter: James Peach
>Priority: Minor
>
> On my Fedora 23, overlayfs is a module that is not loaded by default 
> (attempting to mount an overlayfs automatically triggers the module loading). 
> However {{mesos-slave}} won't start until I manually load the module since it 
> is not listed in {{/proc/filesystems}} until is it loaded.
> It would be nice if there was a more reliable way to determine overlayfs 
> support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-4969) improve overlayfs detection

2017-08-10 Thread Aaron Wood (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122375#comment-16122375
 ] 

Aaron Wood commented on MESOS-4969:
---

Sorry, I thought Mesos was only looking at {noformat}/proc/modules{noformat}.

> improve overlayfs detection
> ---
>
> Key: MESOS-4969
> URL: https://issues.apache.org/jira/browse/MESOS-4969
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, storage
>Reporter: James Peach
>Priority: Minor
>
> On my Fedora 23, overlayfs is a module that is not loaded by default 
> (attempting to mount an overlayfs automatically triggers the module loading). 
> However {{mesos-slave}} won't start until I manually load the module since it 
> is not listed in {{/proc/filesystems}} until is it loaded.
> It would be nice if there was a more reliable way to determine overlayfs 
> support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-1719) Master should persist active frameworks information

2017-08-10 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122328#comment-16122328
 ] 

Yan Xu commented on MESOS-1719:
---

[~adam-mesos] does this being labelled {{mesosphere}} mean this is on your 
roadmap in the near to medium term?

> Master should persist active frameworks information
> ---
>
> Key: MESOS-1719
> URL: https://issues.apache.org/jira/browse/MESOS-1719
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Vinod Kone
>Assignee: Yongqiao Wang
>  Labels: mesosphere, reliability
>
> https://issues.apache.org/jira/browse/MESOS-1219 disallows completed 
> frameworks from re-registering with the same framework id, as long as the 
> master doesn't failover.
> This ticket tracks the work for it work across the master failover using 
> registrar.
> There are some open questions that need to be addressed:
> --> Should registry contain framework ids only framework infos.
> For disallowing completed frameworks from re-registering, persisting 
> framework ids is enough. But, if in the future, we want to disallow
> frameworks from re-registering if some parts of framework info
> changed then we need to persist the info too.
> --> How to update the framework info.
>   Currently frameworks are allowed to update framework info while re-
>   registering, but it only takes effect on the master when the master 
> fails 
>   over and on the slave when the slave fails over. How should things 
>change when persist framework info?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7872) Scheduler hang when registration fails (due to bad role)

2017-08-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7872:
---
  Sprint: Mesosphere Sprint 61
Story Points: 3

> Scheduler hang when registration fails (due to bad role)
> 
>
> Key: MESOS-7872
> URL: https://issues.apache.org/jira/browse/MESOS-7872
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler driver
>Affects Versions: 1.4.0
>Reporter: Till Toenshoff
>Assignee: Alexander Rukletsov
>  Labels: framework, reliability, scheduler
>
> I'm finding that if framework registration fails, the mesos driver client 
> will hang indefinitely with the following output:
> {noformat}
> I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' 
> is not a valid role: Role '/test/role/slashes' cannot start with a slash'
> I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver
> I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework 
> {noformat}
> I'd have expected one or both of the following:
> - SchedulerDriver.run() should have exited with a failed Proto.Status of some 
> form
> - Scheduler.error() should have been invoked when the "Got error" occurred
> Steps to reproduce:
> - Launch a scheduler instance, have it register with a known-bad framework 
> info. In this case a role containing slashes was used
> - Observe that the scheduler continues in a TASK_RUNNING state despite the 
> failed registration. From all appearances it looks like the Scheduler 
> implementation isn't invoked at all
> I'd guess that because this failure happens before framework registration, 
> there's some error handling that isn't fully initialized at this point.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7872) Scheduler hang when registration fails (due to bad role)

2017-08-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7872:
---
Shepherd: Anand Mazumdar

> Scheduler hang when registration fails (due to bad role)
> 
>
> Key: MESOS-7872
> URL: https://issues.apache.org/jira/browse/MESOS-7872
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler driver
>Affects Versions: 1.4.0
>Reporter: Till Toenshoff
>Assignee: Alexander Rukletsov
>  Labels: framework, reliability, scheduler
>
> I'm finding that if framework registration fails, the mesos driver client 
> will hang indefinitely with the following output:
> {noformat}
> I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' 
> is not a valid role: Role '/test/role/slashes' cannot start with a slash'
> I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver
> I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework 
> {noformat}
> I'd have expected one or both of the following:
> - SchedulerDriver.run() should have exited with a failed Proto.Status of some 
> form
> - Scheduler.error() should have been invoked when the "Got error" occurred
> Steps to reproduce:
> - Launch a scheduler instance, have it register with a known-bad framework 
> info. In this case a role containing slashes was used
> - Observe that the scheduler continues in a TASK_RUNNING state despite the 
> failed registration. From all appearances it looks like the Scheduler 
> implementation isn't invoked at all
> I'd guess that because this failure happens before framework registration, 
> there's some error handling that isn't fully initialized at this point.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7872) Scheduler hang when registration fails (due to bad role)

2017-08-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-7872:
--

Assignee: Alexander Rukletsov

> Scheduler hang when registration fails (due to bad role)
> 
>
> Key: MESOS-7872
> URL: https://issues.apache.org/jira/browse/MESOS-7872
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler driver
>Affects Versions: 1.4.0
>Reporter: Till Toenshoff
>Assignee: Alexander Rukletsov
>  Labels: framework, reliability, scheduler
>
> I'm finding that if framework registration fails, the mesos driver client 
> will hang indefinitely with the following output:
> {noformat}
> I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' 
> is not a valid role: Role '/test/role/slashes' cannot start with a slash'
> I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver
> I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework 
> {noformat}
> I'd have expected one or both of the following:
> - SchedulerDriver.run() should have exited with a failed Proto.Status of some 
> form
> - Scheduler.error() should have been invoked when the "Got error" occurred
> Steps to reproduce:
> - Launch a scheduler instance, have it register with a known-bad framework 
> info. In this case a role containing slashes was used
> - Observe that the scheduler continues in a TASK_RUNNING state despite the 
> failed registration. From all appearances it looks like the Scheduler 
> implementation isn't invoked at all
> I'd guess that because this failure happens before framework registration, 
> there's some error handling that isn't fully initialized at this point.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7879) The kill nested container call should provide ability to specify a signal.

2017-08-10 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-7879:
-

 Summary: The kill nested container call should provide ability to 
specify a signal.
 Key: MESOS-7879
 URL: https://issues.apache.org/jira/browse/MESOS-7879
 Project: Mesos
  Issue Type: Task
Reporter: Anand Mazumdar
Assignee: Anand Mazumdar


Currently, the {{KILL_NESTED_CONTAINER}} only sends the SIGKILL signal to a 
running container. We should make it configurable and then make the default 
executor specify it i.e., initially send SIGTERM followed by a SIGTERM signal.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (MESOS-7872) Scheduler hang when registration fails (due to bad role)

2017-08-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7872:
---
Comment: was deleted

(was: The problem is likely in the HTTP adapter. [Java side of the 
adapter|https://github.com/mesosphere/mesos-http-adapter/blob/master/src/main/java/com/mesosphere/mesos/http-adapter/MesosToSchedulerDriverAdapter.java]
 sends a {{SUBSCRIBE}} request that never completes, due to an error. That 
error is transferred to the [C++ side of the 
adapter|https://github.com/apache/mesos/blob/364abfc1bed8543b984ebd3712047b5ed8a109d2/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp#L550],
 but is not transmitted to the java side, because {{SUBSCRIBED}} [has not 
succeeded|https://github.com/apache/mesos/blob/364abfc1bed8543b984ebd3712047b5ed8a109d2/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp#L699]
 yet! Deadlock.

A fix here would be allowing {{ERROR}} events to go through even if the 
scheduler has not subscribed yet.)

> Scheduler hang when registration fails (due to bad role)
> 
>
> Key: MESOS-7872
> URL: https://issues.apache.org/jira/browse/MESOS-7872
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler driver
>Affects Versions: 1.4.0
>Reporter: Till Toenshoff
>  Labels: framework, reliability, scheduler
>
> I'm finding that if framework registration fails, the mesos driver client 
> will hang indefinitely with the following output:
> {noformat}
> I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' 
> is not a valid role: Role '/test/role/slashes' cannot start with a slash'
> I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver
> I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework 
> {noformat}
> I'd have expected one or both of the following:
> - SchedulerDriver.run() should have exited with a failed Proto.Status of some 
> form
> - Scheduler.error() should have been invoked when the "Got error" occurred
> Steps to reproduce:
> - Launch a scheduler instance, have it register with a known-bad framework 
> info. In this case a role containing slashes was used
> - Observe that the scheduler continues in a TASK_RUNNING state despite the 
> failed registration. From all appearances it looks like the Scheduler 
> implementation isn't invoked at all
> I'd guess that because this failure happens before framework registration, 
> there's some error handling that isn't fully initialized at this point.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7872) Scheduler hang when registration fails (due to bad role)

2017-08-10 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122058#comment-16122058
 ] 

Alexander Rukletsov commented on MESOS-7872:


The problem is likely in the HTTP adapter. [Java side of the 
adapter|https://github.com/mesosphere/mesos-http-adapter/blob/master/src/main/java/com/mesosphere/mesos/http-adapter/MesosToSchedulerDriverAdapter.java]
 sends a {{SUBSCRIBE}} request that never completes, due to an error. That 
error is transferred to the [C++ side of the 
adapter|https://github.com/apache/mesos/blob/364abfc1bed8543b984ebd3712047b5ed8a109d2/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp#L550],
 but is not transmitted to the java side, because {{SUBSCRIBED}} [has not 
succeeded|https://github.com/apache/mesos/blob/364abfc1bed8543b984ebd3712047b5ed8a109d2/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp#L699]
 yet! Deadlock.

A fix here would be allowing {{ERROR}} events to go through even if the 
scheduler has not subscribed yet.

> Scheduler hang when registration fails (due to bad role)
> 
>
> Key: MESOS-7872
> URL: https://issues.apache.org/jira/browse/MESOS-7872
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler driver
>Affects Versions: 1.4.0
>Reporter: Till Toenshoff
>  Labels: framework, reliability, scheduler
>
> I'm finding that if framework registration fails, the mesos driver client 
> will hang indefinitely with the following output:
> {noformat}
> I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' 
> is not a valid role: Role '/test/role/slashes' cannot start with a slash'
> I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver
> I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework 
> {noformat}
> I'd have expected one or both of the following:
> - SchedulerDriver.run() should have exited with a failed Proto.Status of some 
> form
> - Scheduler.error() should have been invoked when the "Got error" occurred
> Steps to reproduce:
> - Launch a scheduler instance, have it register with a known-bad framework 
> info. In this case a role containing slashes was used
> - Observe that the scheduler continues in a TASK_RUNNING state despite the 
> failed registration. From all appearances it looks like the Scheduler 
> implementation isn't invoked at all
> I'd guess that because this failure happens before framework registration, 
> there's some error handling that isn't fully initialized at this point.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6950) Launching two tasks with the same Docker image simultaneously may cause a staging dir never cleaned up

2017-08-10 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-6950:

Labels: containerizer mesosphere  (was: )

> Launching two tasks with the same Docker image simultaneously may cause a 
> staging dir never cleaned up
> --
>
> Key: MESOS-6950
> URL: https://issues.apache.org/jira/browse/MESOS-6950
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>  Labels: containerizer, mesosphere
>
> If user launches two tasks with the same Docker image simultaneously (e.g., 
> run {{mesos-executor}} twice with the same Docker image), there will be a 
> staging directory which is for the second task never cleaned up, like this:
> {code}
> └── store
> └── docker
> ├── layers
> │...
> ├── staging
> │   └── a6rXWC
> └── storedImages
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7874) Provide a consistent non-blocking preLaunch hook

2017-08-10 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-7874:
-
Description: 
Our use case: we need a non-blocking way to notify our secret management system 
during task launching sequence on agent. This mechanism needs to work for both 
{{DockerContainerizer}} and {{MesosContainerizer}}, and both {{custom 
executor}} and {{command executor}}, with proper access to labels on 
{{TaskInfo}}.

As of 1.3.0, the hooks in [hook.hpp | 
https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
inconsistent on these combination cases.

The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however 
it has a couple of problems:

1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends 
a `None()` instead;
2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
because people can implement an {{isolator}}? However, it creates extra work 
for module authors and operators.

The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
1. Error are silently swallowed so module cannot stop the task running sequence;
2. It's a blocking version, which means we cannot wait for another subprocess's 
or RPC result.

I'm inclined to fix the two problems on 
{{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.

  was:
Our use case: we need a non-blocking prelaunch hook to integrate with our own 
secret management system, and this hook needs to work under both 
{{DockerContainerizer}} and {{MesosContainerizer}}, for both {{custom 
executor}} and {{command executor}}, with proper access to {{TaskInfo}} 
(actually certain labels on it).

As of 1.3.0, the hooks in [hook.hpp | 
https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
inconsistent on these combination cases.

The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, however 
it has a couple of problems:

1. For DockerContainerizer + custom executor, it strips away TaskInfo and sends 
a `None()` instead;
2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
because people can implement an {{isolator}}? However, it creates extra work 
for module authors and operators.

The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
1. Error are silently swallowed so module cannot stop the task running sequence;
2. It's a blocking version, which means we cannot wait for another subprocess's 
or RPC result.

I'm inclined to fix the two problems on 
{{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.


> Provide a consistent non-blocking preLaunch hook
> 
>
> Key: MESOS-7874
> URL: https://issues.apache.org/jira/browse/MESOS-7874
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: hooks, module
>
> Our use case: we need a non-blocking way to notify our secret management 
> system during task launching sequence on agent. This mechanism needs to work 
> for both {{DockerContainerizer}} and {{MesosContainerizer}}, and both 
> {{custom executor}} and {{command executor}}, with proper access to labels on 
> {{TaskInfo}}.
> As of 1.3.0, the hooks in [hook.hpp | 
> https://github.com/apache/mesos/blob/1.3.0/include/mesos/hook.hpp] pretty 
> inconsistent on these combination cases.
> The closest option on is {{slavePreLaunchDockerTaskExecutorDecorator}}, 
> however it has a couple of problems:
> 1. For DockerContainerizer + custom executor, it strips away TaskInfo and 
> sends a `None()` instead;
> 2. This hook is not called on {{MesosContainerizer}} at all. I guess it's 
> because people can implement an {{isolator}}? However, it creates extra work 
> for module authors and operators.
> The other option is {{slaveRunTaskLabelDecorator}}, but it has own problems:
> 1. Error are silently swallowed so module cannot stop the task running 
> sequence;
> 2. It's a blocking version, which means we cannot wait for another 
> subprocess's or RPC result.
> I'm inclined to fix the two problems on 
> {{slavePreLaunchDockerTaskExecutorDecorator}}, but open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-5482) mesos/marathon task stuck in staging after slave reboot

2017-08-10 Thread Mao Geng (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121944#comment-16121944
 ] 

Mao Geng commented on MESOS-5482:
-

Hit this issue on mesos 1.2.0 and marathon 1.4.3 too. 
The agent timed out the ping for 75secs, then reconnected
{quote}
I0810 13:18:43.142431 18394 slave.cpp:4378] No pings from master received 
within 75secs
I0810 13:18:43.142588 18393 slave.cpp:920] Re-detecting master
I0810 13:18:43.142614 18393 slave.cpp:966] Detecting new master
I0810 13:18:43.142674 18407 status_update_manager.cpp:177] Pausing sending 
status updates
I0810 13:18:43.142755 18420 status_update_manager.cpp:177] Pausing sending 
status updates
I0810 13:18:43.142813 18415 slave.cpp:931] New master detected at 
master@10.1.36.4:5050
I0810 13:18:43.142840 18415 slave.cpp:955] No credentials provided. Attempting 
to register without authentication
I0810 13:18:43.142853 18415 slave.cpp:966] Detecting new master
I0810 13:18:44.431833 18415 slave.cpp:1242] Re-registered with master 
master@10.1.36.4:5050
I0810 13:18:44.431874 18415 slave.cpp:1279] Forwarding total oversubscribed 
resources {}
I0810 13:18:44.431895 18398 status_update_manager.cpp:184] Resuming sending 
status updates
I0810 13:18:44.433912 18386 slave.cpp:2683] Shutting down framework 
f853458f-b07b-4b79-8192-24953f474369-
I0810 13:18:44.433939 18386 slave.cpp:5083] Shutting down executor 
'metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5' of framework 
f853458f-b07b-4b79-8192-24953f474369- at executor(1)@10.1.98.251:33041
W0810 13:18:44.435637 18440 slave.cpp:2823] Ignoring updating pid for framework 
f853458f-b07b-4b79-8192-24953f474369- because it is terminating
I0810 13:18:46.878993 18408 slave.cpp:1625] Got assigned task 
'metrics_statsd.70dff634-7dce-11e7-bea2-0242f4eb80ac' for framework 
f853458f-b07b-4b79-8192-24953f474369-
I0810 13:18:46.879406 18408 slave.cpp:1785] Launching task 
'metrics_statsd.70dff634-7dce-11e7-bea2-0242f4eb80ac' for framework 
f853458f-b07b-4b79-8192-24953f474369-
W0810 13:18:46.879436 18408 slave.cpp:1853] Ignoring running task 
'metrics_statsd.70dff634-7dce-11e7-bea2-0242f4eb80ac' of framework 
f853458f-b07b-4b79-8192-24953f474369- because the framework is terminating
I0810 13:18:47.613224 18415 slave.cpp:3816] Handling status update TASK_KILLED 
(UUID: af78fc5c-8552-4aee-abae-cda3d0ec2909) for task 
metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5 of framework 
f853458f-b07b-4b79-8192-24953f474369- from executor(1)@10.1.98.251:33041
W0810 13:18:47.613261 18415 slave.cpp:3885] Ignoring status update TASK_KILLED 
(UUID: af78fc5c-8552-4aee-abae-cda3d0ec2909) for task 
metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5 of framework 
f853458f-b07b-4b79-8192-24953f474369- for terminating framework 
f853458f-b07b-4b79-8192-24953f474369-
I0810 13:18:48.618629 18409 slave.cpp:4388] Got exited event for 
executor(1)@10.1.98.251:33041
I0810 13:18:48.713826 18390 docker.cpp:2358] Executor for container 
1f351db2-1011-4244-83c2-1854c44d7b65 has exited
I0810 13:18:48.713850 18390 docker.cpp:2052] Destroying container 
1f351db2-1011-4244-83c2-1854c44d7b65
I0810 13:18:48.713892 18390 docker.cpp:2179] Running docker stop on container 
1f351db2-1011-4244-83c2-1854c44d7b65
I0810 13:18:48.714363 18411 slave.cpp:4769] Executor 
'metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5' of framework 
f853458f-b07b-4b79-8192-24953f474369- exited with status 0
I0810 13:18:48.714390 18411 slave.cpp:4869] Cleaning up executor 
'metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5' of framework 
f853458f-b07b-4b79-8192-24953f474369- at executor(1)@10.1.98.251:33041
I0810 13:18:48.714589 18411 slave.cpp:4957] Cleaning up framework 
f853458f-b07b-4b79-8192-24953f474369-
I0810 13:18:48.714607 18432 gc.cpp:55] Scheduling 
'/mnt/mesos/slaves/508bde0b-4661-4a29-b674-32163345096f-S229/frameworks/f853458f-b07b-4b79-8192-24953f474369-/executors/metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5/runs/1f351db2-1011-4244-83c2-1854c44d7b65'
 for gc 6.9173026667days in the future
I0810 13:18:48.714669 18410 status_update_manager.cpp:285] Closing status 
update streams for framework f853458f-b07b-4b79-8192-24953f474369-
I0810 13:18:48.714679 18432 gc.cpp:55] Scheduling 
'/mnt/mesos/slaves/508bde0b-4661-4a29-b674-32163345096f-S229/frameworks/f853458f-b07b-4b79-8192-24953f474369-/executors/metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5'
 for gc 6.9172979259days in the future
I0810 13:18:48.714709 18432 gc.cpp:55] Scheduling 
'/mnt/mesos/meta/slaves/508bde0b-4661-4a29-b674-32163345096f-S229/frameworks/f853458f-b07b-4b79-8192-24953f474369-/executors/metrics_statsd.2e578bc8-7bac-11e7-9ea1-0242c1e4f2c5/runs/1f351db2-1011-4244-83c2-1854c44d7b65'
 for gc 6.9172953778days in the future
I0810 13:18:48.714725 18432 gc.cpp:55] Scheduling 

[jira] [Commented] (MESOS-6390) Ensure Python support scripts are linted

2017-08-10 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121894#comment-16121894
 ] 

Joseph Wu commented on MESOS-6390:
--

{code}
commit d04ab2096169513561d20a414c67ed1aaed0ecd7
Author: Armand Grillet 
Date:   Thu Aug 10 09:38:43 2017 -0700

Linted support/test-upgrade.py.

This will allow us to use PyLint on the
entire support directory in the future.

Review: https://reviews.apache.org/r/60235/
{code}

> Ensure Python support scripts are linted
> 
>
> Key: MESOS-6390
> URL: https://issues.apache.org/jira/browse/MESOS-6390
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Assignee: Armand Grillet
>  Labels: newbie, python
>
> Currently {{support/mesos-style.py}} does not lint files under {{support/}}. 
> This is mostly due to the fact that these scripts are too inconsistent 
> style-wise that they wouldn't even pass the linter now.
> We should clean up all Python scripts under {{support/}} so they pass the 
> Python linter, and activate that directory in the linter for future 
> additions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7878) Add default value for http_framework_authenticators flag

2017-08-10 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-7878:


 Summary: Add default value for http_framework_authenticators flag
 Key: MESOS-7878
 URL: https://issues.apache.org/jira/browse/MESOS-7878
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li
Priority: Minor


Based on http://mesos.apache.org/documentation/latest/configuration/, 
{{http_authenticator}} has a default value {{basic}} but 
{{http_framework_authenticators}} does not one.

Given that people running default Mesos distribution only has {{basic}} 
available, I feel that we should add a default value to this flag to avoid 
surprise to operators when they turn on http framework.

Proposing Greg to shepherd.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7877) Audit test code for undefined behavior in accessing container elements

2017-08-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7877:
---
Labels: mesosphere newbie tech-debt test  (was: mesosphere newbie tech-debt)

> Audit test code for undefined behavior in accessing container elements
> --
>
> Key: MESOS-7877
> URL: https://issues.apache.org/jira/browse/MESOS-7877
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Bannier
>  Labels: mesosphere, newbie, tech-debt, test
>
> We do not always make sure we never access elements from empty containers, 
> e.g., we use patterns like the following
> {code}
> Future offers;
> // Satisfy offers.
> EXPECT_FALSE(offers.empty());
> const auto& offer = (*offers)[0];
> {code}
> While the intention here is to diagnose an empty {{offers}}, the code still 
> exhibits undefined behavior in the element access if {{offers}} was indeed 
> empty (compilers might aggressively exploit undefined behavior to e.g., 
> remove "impossible" code). Instead one should prevent accessing any elements 
> of an empty container, e.g.,
> {code}
> ASSERT_FALSE(offers.empty()); // Prevent execution of rest of test body.
> {code}
> We should audit and fix existing test code for such incorrect checks and 
> variations involving e.g., {{EXPECT_NE}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7877) Audit test code for undefined behavior in accessing container elements

2017-08-10 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-7877:
---

 Summary: Audit test code for undefined behavior in accessing 
container elements
 Key: MESOS-7877
 URL: https://issues.apache.org/jira/browse/MESOS-7877
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Bannier


We do not always make sure we never access elements from empty containers, 
e.g., we use patterns like the following
{code}
Future offers;

// Satisfy offers.

EXPECT_FALSE(offers.empty());

const auto& offer = (*offers)[0];
{code}

While the intention here is to diagnose an empty {{offers}}, the code still 
exhibits undefined behavior in the element access if {{offers}} was indeed 
empty (compilers might aggressively exploit undefined behavior to e.g., remove 
"impossible" code). Instead one should prevent accessing any elements of an 
empty container, e.g.,
{code}
ASSERT_FALSE(offers.empty()); // Prevent execution of rest of test body.
{code}

We should audit and fix existing test code for such incorrect checks and 
variations involving e.g., {{EXPECT_NE}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7872) Scheduler hang when registration fails (due to bad role)

2017-08-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7872:
---
 Labels: framework reliability scheduler  (was: framework scheduler)
Component/s: scheduler driver

> Scheduler hang when registration fails (due to bad role)
> 
>
> Key: MESOS-7872
> URL: https://issues.apache.org/jira/browse/MESOS-7872
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler driver
>Affects Versions: 1.4.0
>Reporter: Till Toenshoff
>  Labels: framework, reliability, scheduler
>
> I'm finding that if framework registration fails, the mesos driver client 
> will hang indefinitely with the following output:
> {noformat}
> I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' 
> is not a valid role: Role '/test/role/slashes' cannot start with a slash'
> I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver
> I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework 
> {noformat}
> I'd have expected one or both of the following:
> - SchedulerDriver.run() should have exited with a failed Proto.Status of some 
> form
> - Scheduler.error() should have been invoked when the "Got error" occurred
> Steps to reproduce:
> - Launch a scheduler instance, have it register with a known-bad framework 
> info. In this case a role containing slashes was used
> - Observe that the scheduler continues in a TASK_RUNNING state despite the 
> failed registration. From all appearances it looks like the Scheduler 
> implementation isn't invoked at all
> I'd guess that because this failure happens before framework registration, 
> there's some error handling that isn't fully initialized at this point.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7872) Scheduler hang when registration fails (due to bad role)

2017-08-10 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121387#comment-16121387
 ] 

Alexander Rukletsov commented on MESOS-7872:


I've tried to reproduce this issue using a slightly modified 
{{no-executor-framework}}. Here is the output I get:
{noformat}
alex@alexr: ~/Projects/mesos/build/default $ ./src/no-executor-framework 
--master=127.0.0.1:5050
I0810 11:55:46.766144 1993596928 sched.cpp:232] Version: 1.4.0
I0810 11:55:46.766348 1993596928 sched.cpp:2090] Awaiting latch
I0810 11:55:46.771299 3211264 sched.cpp:336] New master detected at 
master@127.0.0.1:5050
I0810 11:55:46.774588 3211264 sched.cpp:352] No credentials provided. 
Attempting to register without authentication
I0810 11:55:46.792697 2674688 sched.cpp:1187] Got error ''FrameworkInfo.role' 
is not a valid role: Role '/test/rt' cannot start with a slash'
I0810 11:55:46.792721 2674688 sched.cpp:2055] Asked to abort the driver
E0810 11:55:46.792738 2674688 no_executor_framework.cpp:216] 
'FrameworkInfo.role' is not a valid role: Role '/test/rt' cannot start with a 
slash
I0810 11:55:46.792752 2674688 sched.cpp:1233] Aborting framework 
E0810 11:55:46.792788 4820992 process.cpp:2584] Failed to shutdown socket with 
fd 9, address 192.168.1.113:56500: Socket is not connected
I0810 11:55:46.792866 1993596928 sched.cpp:2092] Latch is triggered
I0810 11:55:46.792881 1993596928 sched.cpp:2021] Asked to stop the driver
{noformat}
If I remove 
[{{driver->stop}}|https://github.com/apache/mesos/blob/2cea83653afcf6d7470242379809645bfe009016/src/examples/no_executor_framework.cpp#L398],
 the scheduler exits anyway:
{noformat}
alex@alexr: ~/Projects/mesos/build/default $ ./src/no-executor-framework 
--master=127.0.0.1:5050
I0810 12:00:46.115882 1993596928 sched.cpp:232] Version: 1.4.0
I0810 12:00:46.116058 1993596928 sched.cpp:2090] Awaiting latch
I0810 12:00:46.118584 2674688 sched.cpp:336] New master detected at 
master@127.0.0.1:5050
I0810 12:00:46.118834 2674688 sched.cpp:352] No credentials provided. 
Attempting to register without authentication
I0810 12:00:46.120816 4284416 sched.cpp:1187] Got error ''FrameworkInfo.role' 
is not a valid role: Role '/test/role' cannot start with a slash'
I0810 12:00:46.120842 4284416 sched.cpp:2055] Asked to abort the driver
E0810 12:00:46.120847 4820992 process.cpp:2584] Failed to shutdown socket with 
fd 9, address 192.168.1.113:57081: Socket is not connected
E0810 12:00:46.120869 4284416 no_executor_framework.cpp:216] 
'FrameworkInfo.role' is not a valid role: Role '/test/role' cannot start with a 
slash
I0810 12:00:46.120895 4284416 sched.cpp:1233] Aborting framework 
I0810 12:00:46.121004 1993596928 sched.cpp:2092] Latch is triggered
{noformat}
Can you share the code of you scheduler, especially the part where you create 
and wait for the driver?

> Scheduler hang when registration fails (due to bad role)
> 
>
> Key: MESOS-7872
> URL: https://issues.apache.org/jira/browse/MESOS-7872
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Till Toenshoff
>  Labels: framework, scheduler
>
> I'm finding that if framework registration fails, the mesos driver client 
> will hang indefinitely with the following output:
> {noformat}
> I0809 20:04:22.47939173 sched.cpp:1187] Got error ''FrameworkInfo.role' 
> is not a valid role: Role '/test/role/slashes' cannot start with a slash'
> I0809 20:04:22.47965873 sched.cpp:2055] Asked to abort the driver
> I0809 20:04:22.47984373 sched.cpp:1233] Aborting framework 
> {noformat}
> I'd have expected one or both of the following:
> - SchedulerDriver.run() should have exited with a failed Proto.Status of some 
> form
> - Scheduler.error() should have been invoked when the "Got error" occurred
> Steps to reproduce:
> - Launch a scheduler instance, have it register with a known-bad framework 
> info. In this case a role containing slashes was used
> - Observe that the scheduler continues in a TASK_RUNNING state despite the 
> failed registration. From all appearances it looks like the Scheduler 
> implementation isn't invoked at all
> I'd guess that because this failure happens before framework registration, 
> there's some error handling that isn't fully initialized at this point.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6950) Launching two tasks with the same Docker image simultaneously may cause a staging dir never cleaned up

2017-08-10 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121352#comment-16121352
 ] 

Qian Zhang commented on MESOS-6950:
---

RR: https://reviews.apache.org/r/61546/

> Launching two tasks with the same Docker image simultaneously may cause a 
> staging dir never cleaned up
> --
>
> Key: MESOS-6950
> URL: https://issues.apache.org/jira/browse/MESOS-6950
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> If user launches two tasks with the same Docker image simultaneously (e.g., 
> run {{mesos-executor}} twice with the same Docker image), there will be a 
> staging directory which is for the second task never cleaned up, like this:
> {code}
> └── store
> └── docker
> ├── layers
> │...
> ├── staging
> │   └── a6rXWC
> └── storedImages
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7876) Investigate jemalloc as a possible malloc for mesos

2017-08-10 Thread Benno Evers (JIRA)
Benno Evers created MESOS-7876:
--

 Summary: Investigate jemalloc as a possible malloc for mesos
 Key: MESOS-7876
 URL: https://issues.apache.org/jira/browse/MESOS-7876
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers
Assignee: Benno Evers


It is currently very hard to debug memory issues, in particular memory leaks, 
in mesos.

An alluring way to improve the situation would be to change the default malloc 
to jemalloc, which has built-in heap-tracking capabilities.

However, some care needs to be taken when considering to change such a 
fundamental part of mesos:

  * Would such a switch have any adverse impact on performance?
  * Is it available and will it compile on all our target platforms?
  * Is the jemalloc-licensing compatible with bundling as third-party library?





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6950) Launching two tasks with the same Docker image simultaneously may cause a staging dir never cleaned up

2017-08-10 Thread Qian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang updated MESOS-6950:
--
Shepherd: Gilbert Song  (was: Qian Zhang)

> Launching two tasks with the same Docker image simultaneously may cause a 
> staging dir never cleaned up
> --
>
> Key: MESOS-6950
> URL: https://issues.apache.org/jira/browse/MESOS-6950
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> If user launches two tasks with the same Docker image simultaneously (e.g., 
> run {{mesos-executor}} twice with the same Docker image), there will be a 
> staging directory which is for the second task never cleaned up, like this:
> {code}
> └── store
> └── docker
> ├── layers
> │...
> ├── staging
> │   └── a6rXWC
> └── storedImages
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-6950) Launching two tasks with the same Docker image simultaneously may cause a staging dir never cleaned up

2017-08-10 Thread Qian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-6950:
-

Assignee: Qian Zhang  (was: Gilbert Song)

> Launching two tasks with the same Docker image simultaneously may cause a 
> staging dir never cleaned up
> --
>
> Key: MESOS-6950
> URL: https://issues.apache.org/jira/browse/MESOS-6950
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> If user launches two tasks with the same Docker image simultaneously (e.g., 
> run {{mesos-executor}} twice with the same Docker image), there will be a 
> staging directory which is for the second task never cleaned up, like this:
> {code}
> └── store
> └── docker
> ├── layers
> │...
> ├── staging
> │   └── a6rXWC
> └── storedImages
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)