[jira] [Commented] (MESOS-8474) Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky
[ https://issues.apache.org/jira/browse/MESOS-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338648#comment-16338648 ] Chun-Hung Hsiao commented on MESOS-8474: This is a similar issue, but now a race between {{CREATE_VOLUME}} and {{CREATE_BLOCK}}. Will implement a better synchronization logic in this test. > Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky > > > Key: MESOS-8474 > URL: https://issues.apache.org/jira/browse/MESOS-8474 > Project: Mesos > Issue Type: Bug > Components: storage, test >Affects Versions: 1.5.0 >Reporter: Benjamin Bannier >Assignee: Chun-Hung Hsiao >Priority: Major > Labels: flaky, flaky-test, mesosphere > Attachments: consoleText.txt, consoleText.txt > > > Observed on our internal CI on ubuntu16.04 with SSL and GRPC enabled, > {noformat} > ../../src/tests/storage_local_resource_provider_tests.cpp:1898 > Expected: 2u > Which is: 2 > To be equal to: destroyed.size() > Which is: 1 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8487) Introduce API changes for supporting quota limits.
[ https://issues.apache.org/jira/browse/MESOS-8487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338639#comment-16338639 ] Benjamin Mahler commented on MESOS-8487: https://reviews.apache.org/r/65334/ > Introduce API changes for supporting quota limits. > -- > > Key: MESOS-8487 > URL: https://issues.apache.org/jira/browse/MESOS-8487 > Project: Mesos > Issue Type: Task > Components: HTTP API >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > > Per MESOS-8068, the introduction of a quota limit requires introducing this > in the API. We should send out the proposed changes more broadly in the > interest of being more rigorous about API changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8488) Docker bug can cause unkillable tasks
[ https://issues.apache.org/jira/browse/MESOS-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilbert Song updated MESOS-8488: Component/s: containerization > Docker bug can cause unkillable tasks > - > > Key: MESOS-8488 > URL: https://issues.apache.org/jira/browse/MESOS-8488 > Project: Mesos > Issue Type: Improvement > Components: containerization >Affects Versions: 1.5.0 >Reporter: Greg Mann >Priority: Major > Labels: mesosphere > > Due to an [issue on the Moby > project|https://github.com/moby/moby/issues/33820], it's possible for Docker > versions 1.13 and later to fail to catch a container exit, so that the > {{docker run}} command which was used to launch the container will never > return. This can lead to the Docker executor becoming stuck in a state where > it believes the container is still running and cannot be killed. > We should update the Docker executor to ensure that containers stuck in such > a state cannot cause unkillable Docker executors/tasks. > One way to do this would be a timeout, after which the Docker executor will > commit suicide if a kill task attempt has not succeeded. However, if we do > this we should also ensure that in the case that the container was actually > still running, either the Docker daemon or the DockerContainerizer would > clean up the container when it does exit. > Another option might be for the Docker executor to directly {{wait()}} on the > container's Linux PID, in order to notice when the container exits. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8488) Docker bug can cause unkillable tasks
Greg Mann created MESOS-8488: Summary: Docker bug can cause unkillable tasks Key: MESOS-8488 URL: https://issues.apache.org/jira/browse/MESOS-8488 Project: Mesos Issue Type: Improvement Affects Versions: 1.5.0 Reporter: Greg Mann Due to an [issue on the Moby project|https://github.com/moby/moby/issues/33820], it's possible for Docker versions 1.13 and later to fail to catch a container exit, so that the {{docker run}} command which was used to launch the container will never return. This can lead to the Docker executor becoming stuck in a state where it believes the container is still running and cannot be killed. We should update the Docker executor to ensure that containers stuck in such a state cannot cause unkillable Docker executors/tasks. One way to do this would be a timeout, after which the Docker executor will commit suicide if a kill task attempt has not succeeded. However, if we do this we should also ensure that in the case that the container was actually still running, either the Docker daemon or the DockerContainerizer would clean up the container when it does exit. Another option might be for the Docker executor to directly {{wait()}} on the container's Linux PID, in order to notice when the container exits. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6822) CNI reports confusing error message for failed interface setup.
[ https://issues.apache.org/jira/browse/MESOS-6822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338509#comment-16338509 ] Qian Zhang commented on MESOS-6822: --- commit 2cdbec02e37c794627204f0e1fadf09e5325507d Author: Qian Zhang Date: Tue Jan 23 15:54:58 2018 +0800 Updated the way to output error messages in `NetworkCniIsolatorSetup`. Review: https://reviews.apache.org/r/65306 > CNI reports confusing error message for failed interface setup. > --- > > Key: MESOS-6822 > URL: https://issues.apache.org/jira/browse/MESOS-6822 > Project: Mesos > Issue Type: Bug > Components: network >Affects Versions: 1.1.0 >Reporter: Alexander Rukletsov >Assignee: Qian Zhang >Priority: Major > Fix For: 1.6.0 > > > Saw this today: > {noformat} > Failed to bring up the loopback interface in the new network namespace of pid > 17067: Success > {noformat} > which is produced by this code: > https://github.com/apache/mesos/blob/1e72605e9892eb4e518442ab9c1fe2a1a1696748/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L1854-L1859 > Note that ssh'ing into the machine confirmed that {{ifconfig}} is available > in {{PATH}}. > Full log: http://pastebin.com/hVdNz6yk -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8487) Introduce API changes for supporting quota limits.
Benjamin Mahler created MESOS-8487: -- Summary: Introduce API changes for supporting quota limits. Key: MESOS-8487 URL: https://issues.apache.org/jira/browse/MESOS-8487 Project: Mesos Issue Type: Task Components: HTTP API Reporter: Benjamin Mahler Assignee: Benjamin Mahler Per MESOS-8068, the introduction of a quota limit requires introducing this in the API. We should send out the proposed changes more broadly in the interest of being more rigorous about API changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8068) Non-revocable bursting over quota guarantees via limits.
[ https://issues.apache.org/jira/browse/MESOS-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-8068: --- Target Version/s: 1.6.0 > Non-revocable bursting over quota guarantees via limits. > > > Key: MESOS-8068 > URL: https://issues.apache.org/jira/browse/MESOS-8068 > Project: Mesos > Issue Type: Epic > Components: allocation >Reporter: Benjamin Mahler >Priority: Major > Labels: multitenancy > > Prior to introducing a revocable tier of allocation (see MESOS-4441), there > is a notion of whether a role can burst over its quota guarantee. > We currently apply implicit limits in the following way: > No quota guarantee set: (guarantee 0, no limit) > Quota guarantee set: (guarantee G, limit G) > That is, we only allow support burst-only without guarantee and > guarantee-only without burst. We do not support bursting over some non-zero > guarantee: (guarantee G, limit L >= G). > The idea here is that we should make these implicit limits explicit to > clarify for users the distinction between guarantees and limits, and to > support bursting over the guarantee. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8486) Webui should display role limits.
Benjamin Mahler created MESOS-8486: -- Summary: Webui should display role limits. Key: MESOS-8486 URL: https://issues.apache.org/jira/browse/MESOS-8486 Project: Mesos Issue Type: Task Components: webui Reporter: Benjamin Mahler With the addition of quota limits (see MESOS-8068), the UI should be updated to display the per role limit information. Specifically, the 'Roles' tab needs to be updated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7292) Introduce a "sensitive mode" in Mesos which prevents leaks of sensitive data.
[ https://issues.apache.org/jira/browse/MESOS-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338447#comment-16338447 ] Till Toenshoff commented on MESOS-7292: --- I have linked another environment handling improvement story as they could possibly be solved in one go. > Introduce a "sensitive mode" in Mesos which prevents leaks of sensitive data. > - > > Key: MESOS-7292 > URL: https://issues.apache.org/jira/browse/MESOS-7292 > Project: Mesos > Issue Type: Improvement > Components: security >Reporter: Alexander Rukletsov >Priority: Major > Labels: debugging, mesosphere, newbie++, security > > Consider a following scenario. A user passes some sensitive data in an > environment variable to a task. These data may be logged by Mesos components, > e.g., executor as part of {{mesos-containerizer}} invocation. While this is > useful for debugging, this might be an issue in some production environments. > One of the solution is to have global "sensitive mode", that turns off > logging of such sensitive data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8484) stout test NumifyTest.HexNumberTest fails.
[ https://issues.apache.org/jira/browse/MESOS-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338220#comment-16338220 ] Benno Evers commented on MESOS-8484: In boost 1.53, lexical_cast implements its own parser that doesnt handle the '0x' prefix, therefore parsing the two strings in the test would return an error. In boost 1.65, lexical_cast calls std::istream::operator>>, which on mac (i.e. using libc++) can successfully parse strings of the form "0x10.9" or "0x1p-5", and returns the correct number. On linux platforms (i.e. using libstdc++), std::istream::operator>> is not able to parse these strings and thus returns an error. The function stout::numify wants to achieve platform independence by forbidding these kinds of literals on all platforms. However, the checks are only happening *after* boost was already given the chance to parse the string, which has platform-dependent behaviour. > stout test NumifyTest.HexNumberTest fails. > --- > > Key: MESOS-8484 > URL: https://issues.apache.org/jira/browse/MESOS-8484 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.6.0 > Environment: macOS 10.13.2 (17C88) > Apple LLVM version 9.0.0 (clang-900.0.37) > ../configure && make check -j6 >Reporter: Till Toenshoff >Assignee: Benjamin Bannier >Priority: Blocker > > The current Mesos master shows the following on my machine: > {noformat} > [ RUN ] NumifyTest.HexNumberTest > ../../../3rdparty/stout/tests/numify_tests.cpp:57: Failure > Value of: numify("0x10.9").isError() > Actual: false > Expected: true > ../../../3rdparty/stout/tests/numify_tests.cpp:58: Failure > Value of: numify("0x1p-5").isError() > Actual: false > Expected: true > [ FAILED ] NumifyTest.HexNumberTest (0 ms) > {noformat} > This problem disappears for me when reverting the latest boost upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.
[ https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338131#comment-16338131 ] Gilbert Song commented on MESOS-8480: - [~zhitao] , very likely, since we got -1s for https://issues.apache.org/jira/browse/MESOS-8481 > Mesos returns high resource usage when killing a Docker task. > - > > Key: MESOS-8480 > URL: https://issues.apache.org/jira/browse/MESOS-8480 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Major > Fix For: 1.3.2, 1.4.2, 1.6.0, 1.5.1 > > Attachments: test.cpp > > > The way we get resource statistics for Docker tasks is through getting the > cgroup subsystem path through {{/proc//cgroup}} first (taking the > {{cpuacct}} subsystem as an example): > {noformat} > 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b > {noformat} > Then read > {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} > to get the statistics: > {noformat} > user 4 > system 0 > {noformat} > However, when a Docker container is being teared down, it seems that Docker > or the operation system will first move the process to the root cgroup before > actually killing it, making {{/proc//docker}} look like the following: > {noformat} > 9:cpuacct,cpu:/ > {noformat} > This makes a racy call to > [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] > return a single '/', which in turn makes > [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] > read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the > statistics for the root cgroup: > {noformat} > user 228058750 > system 24506461 > {noformat} > This can be reproduced by [^test.cpp] with the following command: > {noformat} > $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect > sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep > ... > Reading file '/proc/44224/cgroup' > Reading file > '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' > user 4 > system 0 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Failed to open file '/proc/44224/cgroup' > sleep > [2]- Exit 1 ./test $(docker inspect sleep | jq > .[].State.Pid) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-7258) Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.
[ https://issues.apache.org/jira/browse/MESOS-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kapil Arya reassigned MESOS-7258: - Assignee: Kapil Arya > Provide scheduler calls to subscribe to additional roles and unsubscribe from > roles. > > > Key: MESOS-7258 > URL: https://issues.apache.org/jira/browse/MESOS-7258 > Project: Mesos > Issue Type: Improvement > Components: master, scheduler api >Reporter: Benjamin Mahler >Assignee: Kapil Arya >Priority: Major > Labels: multitenancy > > The current support for schedulers to subscribe to additional roles or > unsubscribe from some of their roles requires that the scheduler obtain a new > subscription with the master which invalidates the event stream. > A more lightweight mechanism would be to provide calls for the scheduler to > subscribe to additional roles or unsubscribe from some roles such that the > existing event stream remains open and offers to the new roles arrive on the > existing event stream. E.g. > SUBSCRIBE_TO_ROLE > UNSUBSCRIBE_FROM_ROLE > One open question pertains to the terminology here, whether we would want to > avoid using "subscribe" in this context. An alternative would be: > UPDATE_FRAMEWORK_INFO > Which provides a generic mechanism for a framework to perform framework info > updates without obtaining a new event stream. > In addition, it would be easier to use if it returned 200 on success and an > error response if invalid, etc. Rather than returning 202. > *NOTE*: Not specific to this issue, but we need to figure out how to allow > the framework to not leak reservations, e.g. MESOS-7651. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-7258) Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.
[ https://issues.apache.org/jira/browse/MESOS-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7258: --- Description: The current support for schedulers to subscribe to additional roles or unsubscribe from some of their roles requires that the scheduler obtain a new subscription with the master which invalidates the event stream. A more lightweight mechanism would be to provide calls for the scheduler to subscribe to additional roles or unsubscribe from some roles such that the existing event stream remains open and offers to the new roles arrive on the existing event stream. E.g. SUBSCRIBE_TO_ROLE UNSUBSCRIBE_FROM_ROLE One open question pertains to the terminology here, whether we would want to avoid using "subscribe" in this context. An alternative would be: UPDATE_FRAMEWORK_INFO Which provides a generic mechanism for a framework to perform framework info updates without obtaining a new event stream. In addition, it would be easier to use if it returned 200 on success and an error response if invalid, etc. Rather than returning 202. *NOTE*: Not specific to this issue, but we need to figure out how to allow the framework to not leak reservations, e.g. MESOS-7651. was: The current support for schedulers to subscribe to additional roles or unsubscribe from some of their roles requires that the scheduler obtain a new subscription with the master which invalidates the event stream. A more lightweight mechanism would be to provide calls for the scheduler to subscribe to additional roles or unsubscribe from some roles such that the existing event stream remains open and offers to the new roles arrive on the existing event stream. E.g. SUBSCRIBE_TO_ROLE UNSUBSCRIBE_FROM_ROLE One open question pertains to the terminology here, whether we would want to avoid using "subscribe" in this context. An alternative would be: UPDATE_FRAMEWORK_INFO Which provides a generic mechanism for a framework to perform framework info updates without obtaining a new event stream. *NOTE*: Not specific to this issue, but we need to figure out how to allow the framework to not leak reservations, e.g. MESOS-7651. > Provide scheduler calls to subscribe to additional roles and unsubscribe from > roles. > > > Key: MESOS-7258 > URL: https://issues.apache.org/jira/browse/MESOS-7258 > Project: Mesos > Issue Type: Improvement > Components: master, scheduler api >Reporter: Benjamin Mahler >Priority: Major > Labels: multitenancy > > The current support for schedulers to subscribe to additional roles or > unsubscribe from some of their roles requires that the scheduler obtain a new > subscription with the master which invalidates the event stream. > A more lightweight mechanism would be to provide calls for the scheduler to > subscribe to additional roles or unsubscribe from some roles such that the > existing event stream remains open and offers to the new roles arrive on the > existing event stream. E.g. > SUBSCRIBE_TO_ROLE > UNSUBSCRIBE_FROM_ROLE > One open question pertains to the terminology here, whether we would want to > avoid using "subscribe" in this context. An alternative would be: > UPDATE_FRAMEWORK_INFO > Which provides a generic mechanism for a framework to perform framework info > updates without obtaining a new event stream. > In addition, it would be easier to use if it returned 200 on success and an > error response if invalid, etc. Rather than returning 202. > *NOTE*: Not specific to this issue, but we need to figure out how to allow > the framework to not leak reservations, e.g. MESOS-7651. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8484) stout test NumifyTest.HexNumberTest fails.
[ https://issues.apache.org/jira/browse/MESOS-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-8484: -- Shepherd: Till Toenshoff > stout test NumifyTest.HexNumberTest fails. > --- > > Key: MESOS-8484 > URL: https://issues.apache.org/jira/browse/MESOS-8484 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.6.0 > Environment: macOS 10.13.2 (17C88) > Apple LLVM version 9.0.0 (clang-900.0.37) > ../configure && make check -j6 >Reporter: Till Toenshoff >Assignee: Benjamin Bannier >Priority: Blocker > > The current Mesos master shows the following on my machine: > {noformat} > [ RUN ] NumifyTest.HexNumberTest > ../../../3rdparty/stout/tests/numify_tests.cpp:57: Failure > Value of: numify("0x10.9").isError() > Actual: false > Expected: true > ../../../3rdparty/stout/tests/numify_tests.cpp:58: Failure > Value of: numify("0x1p-5").isError() > Actual: false > Expected: true > [ FAILED ] NumifyTest.HexNumberTest (0 ms) > {noformat} > This problem disappears for me when reverting the latest boost upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8484) stout test NumifyTest.HexNumberTest fails.
[ https://issues.apache.org/jira/browse/MESOS-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff reassigned MESOS-8484: - Assignee: Benjamin Bannier > stout test NumifyTest.HexNumberTest fails. > --- > > Key: MESOS-8484 > URL: https://issues.apache.org/jira/browse/MESOS-8484 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.6.0 > Environment: macOS 10.13.2 (17C88) > Apple LLVM version 9.0.0 (clang-900.0.37) > ../configure && make check -j6 >Reporter: Till Toenshoff >Assignee: Benjamin Bannier >Priority: Blocker > > The current Mesos master shows the following on my machine: > {noformat} > [ RUN ] NumifyTest.HexNumberTest > ../../../3rdparty/stout/tests/numify_tests.cpp:57: Failure > Value of: numify("0x10.9").isError() > Actual: false > Expected: true > ../../../3rdparty/stout/tests/numify_tests.cpp:58: Failure > Value of: numify("0x1p-5").isError() > Actual: false > Expected: true > [ FAILED ] NumifyTest.HexNumberTest (0 ms) > {noformat} > This problem disappears for me when reverting the latest boost upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.
[ https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338114#comment-16338114 ] Zhitao Li edited comment on MESOS-8480 at 1/24/18 7:39 PM: --- Will this be also cherrypicked to 1.5.0 since the RC is still not finalized yet? was (Author: zhitao): Will this be also back ported to 1.5.0 since the RC is still not finalized yet? > Mesos returns high resource usage when killing a Docker task. > - > > Key: MESOS-8480 > URL: https://issues.apache.org/jira/browse/MESOS-8480 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Major > Fix For: 1.3.2, 1.4.2, 1.6.0, 1.5.1 > > Attachments: test.cpp > > > The way we get resource statistics for Docker tasks is through getting the > cgroup subsystem path through {{/proc//cgroup}} first (taking the > {{cpuacct}} subsystem as an example): > {noformat} > 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b > {noformat} > Then read > {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} > to get the statistics: > {noformat} > user 4 > system 0 > {noformat} > However, when a Docker container is being teared down, it seems that Docker > or the operation system will first move the process to the root cgroup before > actually killing it, making {{/proc//docker}} look like the following: > {noformat} > 9:cpuacct,cpu:/ > {noformat} > This makes a racy call to > [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] > return a single '/', which in turn makes > [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] > read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the > statistics for the root cgroup: > {noformat} > user 228058750 > system 24506461 > {noformat} > This can be reproduced by [^test.cpp] with the following command: > {noformat} > $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect > sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep > ... > Reading file '/proc/44224/cgroup' > Reading file > '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' > user 4 > system 0 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Failed to open file '/proc/44224/cgroup' > sleep > [2]- Exit 1 ./test $(docker inspect sleep | jq > .[].State.Pid) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.
[ https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338114#comment-16338114 ] Zhitao Li commented on MESOS-8480: -- Will this be also back ported to 1.5.0 since the RC is still not finalized yet? > Mesos returns high resource usage when killing a Docker task. > - > > Key: MESOS-8480 > URL: https://issues.apache.org/jira/browse/MESOS-8480 > Project: Mesos > Issue Type: Bug > Components: cgroups >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao >Priority: Major > Fix For: 1.3.2, 1.4.2, 1.6.0, 1.5.1 > > Attachments: test.cpp > > > The way we get resource statistics for Docker tasks is through getting the > cgroup subsystem path through {{/proc//cgroup}} first (taking the > {{cpuacct}} subsystem as an example): > {noformat} > 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b > {noformat} > Then read > {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}} > to get the statistics: > {noformat} > user 4 > system 0 > {noformat} > However, when a Docker container is being teared down, it seems that Docker > or the operation system will first move the process to the root cgroup before > actually killing it, making {{/proc//docker}} look like the following: > {noformat} > 9:cpuacct,cpu:/ > {noformat} > This makes a racy call to > [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935] > return a single '/', which in turn makes > [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991] > read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the > statistics for the root cgroup: > {noformat} > user 228058750 > system 24506461 > {noformat} > This can be reproduced by [^test.cpp] with the following command: > {noformat} > $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect > sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep > ... > Reading file '/proc/44224/cgroup' > Reading file > '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat' > user 4 > system 0 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Reading file '/proc/44224/cgroup' > Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat' > user 228058750 > system 24506461 > Failed to open file '/proc/44224/cgroup' > sleep > [2]- Exit 1 ./test $(docker inspect sleep | jq > .[].State.Pid) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8469) Mesos master might drop some events in the operator API stream
[ https://issues.apache.org/jira/browse/MESOS-8469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338104#comment-16338104 ] Greg Mann commented on MESOS-8469: -- Related test reviews: https://reviews.apache.org/r/65315/ https://reviews.apache.org/r/65316/ > Mesos master might drop some events in the operator API stream > -- > > Key: MESOS-8469 > URL: https://issues.apache.org/jira/browse/MESOS-8469 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone >Assignee: Greg Mann >Priority: Critical > Fix For: 1.5.0 > > > Inside `Master::updateTask`, we call `Subscribers::send` which asynchronously > calls `Subscribers::Subscriber::send` on each subscriber. > But the problem is that inside `Subscribers:Subscriber::send` we are looking > up the state of the master (e.g., getting Task* and Framework*) which might > have changed between `Subscribers::send ` and `Subscribers::Subscriber::send`. > > For example, if a terminal task received an acknowledgement the task might be > removed from master's state, causing us to drop the TASK_UPDATED event. > > We noticed this in an internal cluster, where a TASK_KILLED update was sent > to one subscriber but not the other. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8475) Event-specific overloads for 'Master::Subscribers::Subscriber::send()'
[ https://issues.apache.org/jira/browse/MESOS-8475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338102#comment-16338102 ] Greg Mann edited comment on MESOS-8475 at 1/24/18 7:33 PM: --- NOTE that when this ticket is addressed, we will also need to update the related test {{MasterAPITest.EventAuthorizationDelayed}}, since it currently depends on each event causing 4 calls into the authorizer: https://reviews.apache.org/r/65316 was (Author: greggomann): NOTE that when this ticket is addressed, we will also need to update the related test {{MasterAPITest.EventAuthorizationDelayed}}, since it currently depends on each event causing 4 calls into the authorizer. > Event-specific overloads for 'Master::Subscribers::Subscriber::send()' > -- > > Key: MESOS-8475 > URL: https://issues.apache.org/jira/browse/MESOS-8475 > Project: Mesos > Issue Type: Improvement >Reporter: Greg Mann >Priority: Major > Labels: authorization, mesosphere > > The code could be more efficient and more readable if we introduce > event-specific overloads for the {{Master::Subscribers::Subscriber::send()}} > method. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8475) Event-specific overloads for 'Master::Subscribers::Subscriber::send()'
[ https://issues.apache.org/jira/browse/MESOS-8475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338102#comment-16338102 ] Greg Mann commented on MESOS-8475: -- NOTE that when this ticket is addressed, we will also need to update the related test {{MasterAPITest.EventAuthorizationDelayed}}, since it currently depends on each event causing 4 calls into the authorizer. > Event-specific overloads for 'Master::Subscribers::Subscriber::send()' > -- > > Key: MESOS-8475 > URL: https://issues.apache.org/jira/browse/MESOS-8475 > Project: Mesos > Issue Type: Improvement >Reporter: Greg Mann >Priority: Major > Labels: authorization, mesosphere > > The code could be more efficient and more readable if we introduce > event-specific overloads for the {{Master::Subscribers::Subscriber::send()}} > method. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8485) MasterTest.RegistryGcByCount is flaky
[ https://issues.apache.org/jira/browse/MESOS-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-8485: -- Assignee: Benno Evers > MasterTest.RegistryGcByCount is flaky > - > > Key: MESOS-8485 > URL: https://issues.apache.org/jira/browse/MESOS-8485 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 >Reporter: Vinod Kone >Assignee: Benno Evers >Priority: Major > Labels: flaky-test > > Observed this while testing Mesos 1.5.0-rc1 in ASF CI. > > {code} > 3: [ RUN ] MasterTest.RegistryGcByCount > ..snip... > 3: I0123 19:22:05.929347 15994 slave.cpp:1201] Detecting new master > 3: I0123 19:22:05.931701 15988 slave.cpp:1228] Authenticating with master > master@172.17.0.2:45634 > 3: I0123 19:22:05.931838 15988 slave.cpp:1237] Using default CRAM-MD5 > authenticatee > 3: I0123 19:22:05.932153 15999 authenticatee.cpp:121] Creating new client > SASL connection > 3: I0123 19:22:05.932580 15992 master.cpp:8958] Authenticating > slave(442)@172.17.0.2:45634 > 3: I0123 19:22:05.932822 15990 authenticator.cpp:414] Starting authentication > session for crammd5-authenticatee(870)@172.17.0.2:45634 > 3: I0123 19:22:05.933163 15989 authenticator.cpp:98] Creating new server SASL > connection > 3: I0123 19:22:05.933465 16001 authenticatee.cpp:213] Received SASL > authentication mechanisms: CRAM-MD5 > 3: I0123 19:22:05.933495 16001 authenticatee.cpp:239] Attempting to > authenticate with mechanism 'CRAM-MD5' > 3: I0123 19:22:05.933631 15987 authenticator.cpp:204] Received SASL > authentication start > 3: I0123 19:22:05.933712 15987 authenticator.cpp:326] Authentication requires > more steps > 3: I0123 19:22:05.933851 15987 authenticatee.cpp:259] Received SASL > authentication step > 3: I0123 19:22:05.934006 15987 authenticator.cpp:232] Received SASL > authentication step > 3: I0123 19:22:05.934041 15987 auxprop.cpp:109] Request to lookup properties > for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' > SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false > SASL_AUXPROP_AUTHZID: false > 3: I0123 19:22:05.934095 15987 auxprop.cpp:181] Looking up auxiliary property > '*userPassword' > 3: I0123 19:22:05.934147 15987 auxprop.cpp:181] Looking up auxiliary property > '*cmusaslsecretCRAM-MD5' > 3: I0123 19:22:05.934279 15987 auxprop.cpp:109] Request to lookup properties > for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' > SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false > SASL_AUXPROP_AUTHZID: true > 3: I0123 19:22:05.934298 15987 auxprop.cpp:131] Skipping auxiliary property > '*userPassword' since SASL_AUXPROP_AUTHZID == true > 3: I0123 19:22:05.934307 15987 auxprop.cpp:131] Skipping auxiliary property > '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true > 3: I0123 19:22:05.934324 15987 authenticator.cpp:318] Authentication success > 3: I0123 19:22:05.934463 15995 authenticatee.cpp:299] Authentication success > 3: I0123 19:22:05.934563 16002 master.cpp:8988] Successfully authenticated > principal 'test-principal' at slave(442)@172.17.0.2:45634 > 3: I0123 19:22:05.934708 15993 authenticator.cpp:432] Authentication session > cleanup for crammd5-authenticatee(870)@172.17.0.2:45634 > 3: I0123 19:22:05.934891 15995 slave.cpp:1320] Successfully authenticated > with master master@172.17.0.2:45634 > 3: I0123 19:22:05.935261 15995 slave.cpp:1764] Will retry registration in > 2.234083ms if necessary > 3: I0123 19:22:05.935436 15999 master.cpp:6061] Received register agent > message from slave(442)@172.17.0.2:45634 (455912973e2c) > 3: I0123 19:22:05.935662 15999 master.cpp:3867] Authorizing agent with > principal 'test-principal' > 3: I0123 19:22:05.936161 15992 master.cpp:6123] Authorized registration of > agent at slave(442)@172.17.0.2:45634 (455912973e2c) > 3: I0123 19:22:05.936261 15992 master.cpp:6234] Registering agent at > slave(442)@172.17.0.2:45634 (455912973e2c) with id > eef8ea11-9247-44f3-84cf-340b24df3a52-S0 > 3: I0123 19:22:05.936993 15989 registrar.cpp:495] Applied 1 operations in > 227911ns; attempting to update the registry > 3: I0123 19:22:05.937814 15989 registrar.cpp:552] Successfully updated the > registry in 743168ns > 3: I0123 19:22:05.938057 15991 master.cpp:6282] Admitted agent > eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 > (455912973e2c) > 3: I0123 19:22:05.938891 15991 master.cpp:6331] Registered agent > eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 > (455912973e2c) with cpus:2; mem:1024; disk:1024; ports:[31000-32000] > 3: I0123 19:22:05.939159 16002 slave.cpp:1764] Will retry registration in > 26.332876ms if necessary
[jira] [Commented] (MESOS-6985) os::getenv() can segfault
[ https://issues.apache.org/jira/browse/MESOS-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338076#comment-16338076 ] Ilya Pronin commented on MESOS-6985: [~vinodkone], sorry I missed the comment somehow. I have a POC-like patch for this, didn't have time to finish it. I'll try to finish it maybe next week. Feel free to reassign if somebody would like to work on it before that. > os::getenv() can segfault > - > > Key: MESOS-6985 > URL: https://issues.apache.org/jira/browse/MESOS-6985 > Project: Mesos > Issue Type: Bug > Components: stout > Environment: ASF CI, Ubuntu 14.04 and CentOS 7 both with and without > libevent/SSL >Reporter: Greg Mann >Assignee: Ilya Pronin >Priority: Major > Labels: flaky-test, reliability, stout > Attachments: > MasterMaintenanceTest.InverseOffersFilters-truncated.txt, > MasterTest.MultipleExecutors.txt > > > This was observed on ASF CI. The segfault first showed up on CI on 9/20/16 > and has been produced by the tests {{MasterTest.MultipleExecutors}} and > {{MasterMaintenanceTest.InverseOffersFilters}}. In both cases, > {{os::getenv()}} segfaults with the same stack trace: > {code} > *** Aborted at 1485241617 (unix time) try "date -d @1485241617" if you are > using GNU date *** > PC: @ 0x2ad59e3ae82d (unknown) > I0124 07:06:57.422080 28619 exec.cpp:162] Version: 1.2.0 > *** SIGSEGV (@0xf0) received by PID 28591 (TID 0x2ad5a7b87700) from PID 240; > stack trace: *** > I0124 07:06:57.422336 28615 exec.cpp:212] Executor started at: > executor(75)@172.17.0.2:45752 with pid 28591 > @ 0x2ad5ab953197 (unknown) > @ 0x2ad5ab957479 (unknown) > @ 0x2ad59e165330 (unknown) > @ 0x2ad59e3ae82d (unknown) > @ 0x2ad594631358 os::getenv() > @ 0x2ad59aba6acf mesos::internal::slave::executorEnvironment() > @ 0x2ad59ab845c0 mesos::internal::slave::Framework::launchExecutor() > @ 0x2ad59ab818a2 mesos::internal::slave::Slave::_run() > @ 0x2ad59ac1ec10 > _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureIbEERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSF_INS1_13TaskGroupInfoEES6_S9_SC_SH_SL_EEvRKNS_3PIDIT_EEMSP_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES16_ > @ 0x2ad59ac1e6bf > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveERKNS0_6FutureIbEERKNS5_13FrameworkInfoERKNS5_12ExecutorInfoERK6OptionINS5_8TaskInfoEERKSJ_INS5_13TaskGroupInfoEESA_SD_SG_SL_SP_EEvRKNS0_3PIDIT_EEMST_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x2ad59bce2304 std::function<>::operator()() > @ 0x2ad59bcc9824 process::ProcessBase::visit() > @ 0x2ad59bd4028e process::DispatchEvent::visit() > @ 0x2ad594616df1 process::ProcessBase::serve() > @ 0x2ad59bcc72b7 process::ProcessManager::resume() > @ 0x2ad59bcd567c > process::ProcessManager::init_threads()::$_2::operator()() > @ 0x2ad59bcd5585 > _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE > @ 0x2ad59bcd std::_Bind_simple<>::operator()() > @ 0x2ad59bcd552c std::thread::_Impl<>::_M_run() > @ 0x2ad59d9e6a60 (unknown) > @ 0x2ad59e15d184 start_thread > @ 0x2ad59e46d37d (unknown) > make[4]: *** [check-local] Segmentation fault > {code} > Find attached the full log from a failed run of > {{MasterTest.MultipleExecutors}} and a truncated log from a failed run of > {{MasterMaintenanceTest.InverseOffersFilters}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8485) MasterTest.RegistryGcByCount is flaky
Vinod Kone created MESOS-8485: - Summary: MasterTest.RegistryGcByCount is flaky Key: MESOS-8485 URL: https://issues.apache.org/jira/browse/MESOS-8485 Project: Mesos Issue Type: Bug Components: test Affects Versions: 1.5.0 Reporter: Vinod Kone Observed this while testing Mesos 1.5.0-rc1 in ASF CI. {code} 3: [ RUN ] MasterTest.RegistryGcByCount ..snip... 3: I0123 19:22:05.929347 15994 slave.cpp:1201] Detecting new master 3: I0123 19:22:05.931701 15988 slave.cpp:1228] Authenticating with master master@172.17.0.2:45634 3: I0123 19:22:05.931838 15988 slave.cpp:1237] Using default CRAM-MD5 authenticatee 3: I0123 19:22:05.932153 15999 authenticatee.cpp:121] Creating new client SASL connection 3: I0123 19:22:05.932580 15992 master.cpp:8958] Authenticating slave(442)@172.17.0.2:45634 3: I0123 19:22:05.932822 15990 authenticator.cpp:414] Starting authentication session for crammd5-authenticatee(870)@172.17.0.2:45634 3: I0123 19:22:05.933163 15989 authenticator.cpp:98] Creating new server SASL connection 3: I0123 19:22:05.933465 16001 authenticatee.cpp:213] Received SASL authentication mechanisms: CRAM-MD5 3: I0123 19:22:05.933495 16001 authenticatee.cpp:239] Attempting to authenticate with mechanism 'CRAM-MD5' 3: I0123 19:22:05.933631 15987 authenticator.cpp:204] Received SASL authentication start 3: I0123 19:22:05.933712 15987 authenticator.cpp:326] Authentication requires more steps 3: I0123 19:22:05.933851 15987 authenticatee.cpp:259] Received SASL authentication step 3: I0123 19:22:05.934006 15987 authenticator.cpp:232] Received SASL authentication step 3: I0123 19:22:05.934041 15987 auxprop.cpp:109] Request to lookup properties for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: false 3: I0123 19:22:05.934095 15987 auxprop.cpp:181] Looking up auxiliary property '*userPassword' 3: I0123 19:22:05.934147 15987 auxprop.cpp:181] Looking up auxiliary property '*cmusaslsecretCRAM-MD5' 3: I0123 19:22:05.934279 15987 auxprop.cpp:109] Request to lookup properties for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: true 3: I0123 19:22:05.934298 15987 auxprop.cpp:131] Skipping auxiliary property '*userPassword' since SASL_AUXPROP_AUTHZID == true 3: I0123 19:22:05.934307 15987 auxprop.cpp:131] Skipping auxiliary property '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true 3: I0123 19:22:05.934324 15987 authenticator.cpp:318] Authentication success 3: I0123 19:22:05.934463 15995 authenticatee.cpp:299] Authentication success 3: I0123 19:22:05.934563 16002 master.cpp:8988] Successfully authenticated principal 'test-principal' at slave(442)@172.17.0.2:45634 3: I0123 19:22:05.934708 15993 authenticator.cpp:432] Authentication session cleanup for crammd5-authenticatee(870)@172.17.0.2:45634 3: I0123 19:22:05.934891 15995 slave.cpp:1320] Successfully authenticated with master master@172.17.0.2:45634 3: I0123 19:22:05.935261 15995 slave.cpp:1764] Will retry registration in 2.234083ms if necessary 3: I0123 19:22:05.935436 15999 master.cpp:6061] Received register agent message from slave(442)@172.17.0.2:45634 (455912973e2c) 3: I0123 19:22:05.935662 15999 master.cpp:3867] Authorizing agent with principal 'test-principal' 3: I0123 19:22:05.936161 15992 master.cpp:6123] Authorized registration of agent at slave(442)@172.17.0.2:45634 (455912973e2c) 3: I0123 19:22:05.936261 15992 master.cpp:6234] Registering agent at slave(442)@172.17.0.2:45634 (455912973e2c) with id eef8ea11-9247-44f3-84cf-340b24df3a52-S0 3: I0123 19:22:05.936993 15989 registrar.cpp:495] Applied 1 operations in 227911ns; attempting to update the registry 3: I0123 19:22:05.937814 15989 registrar.cpp:552] Successfully updated the registry in 743168ns 3: I0123 19:22:05.938057 15991 master.cpp:6282] Admitted agent eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 (455912973e2c) 3: I0123 19:22:05.938891 15991 master.cpp:6331] Registered agent eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 (455912973e2c) with cpus:2; mem:1024; disk:1024; ports:[31000-32000] 3: I0123 19:22:05.939159 16002 slave.cpp:1764] Will retry registration in 26.332876ms if necessary 3: I0123 19:22:05.939349 15994 master.cpp:6061] Received register agent message from slave(442)@172.17.0.2:45634 (455912973e2c) 3: I0123 19:22:05.939347 15998 hierarchical.cpp:574] Added agent eef8ea11-9247-44f3-84cf-340b24df3a52-S0 (455912973e2c) with cpus:2; mem:1024; disk:1024; ports:[31000-32000] (allocated: {}) 3: I0123 19:22:05.939574 15994 master.cpp:3867] Authorizing agent with
[jira] [Commented] (MESOS-8484) stout test NumifyTest.HexNumberTest fails.
[ https://issues.apache.org/jira/browse/MESOS-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338072#comment-16338072 ] Till Toenshoff commented on MESOS-8484: --- commits tried: - current head (cd2774efde5e55cc027721086af14fbc78688849) -> fails - e91ce42ed56c5ab65220fbba740a8a50c7f835ae -> works > stout test NumifyTest.HexNumberTest fails. > --- > > Key: MESOS-8484 > URL: https://issues.apache.org/jira/browse/MESOS-8484 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.6.0 > Environment: macOS 10.13.2 (17C88) > Apple LLVM version 9.0.0 (clang-900.0.37) > ../configure && make check -j6 >Reporter: Till Toenshoff >Priority: Blocker > > The current Mesos master shows the following on my machine: > {noformat} > [ RUN ] NumifyTest.HexNumberTest > ../../../3rdparty/stout/tests/numify_tests.cpp:57: Failure > Value of: numify("0x10.9").isError() > Actual: false > Expected: true > ../../../3rdparty/stout/tests/numify_tests.cpp:58: Failure > Value of: numify("0x1p-5").isError() > Actual: false > Expected: true > [ FAILED ] NumifyTest.HexNumberTest (0 ms) > {noformat} > This problem disappears for me when reverting the latest boost upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8484) stout test NumifyTest.HexNumberTest fails.
[ https://issues.apache.org/jira/browse/MESOS-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-8484: -- Environment: macOS 10.13.2 (17C88) Apple LLVM version 9.0.0 (clang-900.0.37) ../configure && make check -j6 > stout test NumifyTest.HexNumberTest fails. > --- > > Key: MESOS-8484 > URL: https://issues.apache.org/jira/browse/MESOS-8484 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.6.0 > Environment: macOS 10.13.2 (17C88) > Apple LLVM version 9.0.0 (clang-900.0.37) > ../configure && make check -j6 >Reporter: Till Toenshoff >Priority: Blocker > > The current Mesos master shows the following on my machine: > {noformat} > [ RUN ] NumifyTest.HexNumberTest > ../../../3rdparty/stout/tests/numify_tests.cpp:57: Failure > Value of: numify("0x10.9").isError() > Actual: false > Expected: true > ../../../3rdparty/stout/tests/numify_tests.cpp:58: Failure > Value of: numify("0x1p-5").isError() > Actual: false > Expected: true > [ FAILED ] NumifyTest.HexNumberTest (0 ms) > {noformat} > This problem disappears for me when reverting the latest boost upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8483) ExampleTests PythonFramework fails with sigabort.
[ https://issues.apache.org/jira/browse/MESOS-8483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338061#comment-16338061 ] Benjamin Bannier commented on MESOS-8483: - This fails for me in a different way, {noformat} % ./examples/python/test-framework local [libprotobuf ERROR google/protobuf/descriptor_database.cc:58] File already exists in database: mesos/mesos.proto [libprotobuf FATAL google/protobuf/descriptor.cc:1394] CHECK failed: generated_database_->Add(encoded_file_descriptor, size): libc++abi.dylib: terminating with uncaught exception of type google::protobuf::FatalException: CHECK failed: generated_database_->Add(encoded_file_descriptor, size): [1]57083 abort ./examples/python/test-framework local {noformat} I am using python-2.7.14 and unbundled protobuf-3.5.1 from homebrew. > ExampleTests PythonFramework fails with sigabort. > - > > Key: MESOS-8483 > URL: https://issues.apache.org/jira/browse/MESOS-8483 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.5.0 > Environment: macOS 10.13.2 (17C88) > Python 2.7.10 (Apple's default - not homebrew) >Reporter: Till Toenshoff >Priority: Blocker > > Starting the {{PythonFramework}} manually results in a sigabort: > {noformat} > $ ./src/examples/python/test-framework local > [..] > I0124 15:22:46.637238 65925120 master.cpp:563] Using default 'crammd5' > authenticator > W0124 15:22:46.637269 65925120 authenticator.cpp:513] No credentials > provided, authentication requests will be refused > I0124 15:22:46.637284 65925120 authenticator.cpp:520] Initializing server SASL > I0124 15:22:46.659503 2385417024 resolver.cpp:69] Creating default secret > resolver > I0124 15:22:46.659624 2385417024 containerizer.cpp:304] Using isolation { > environment_secret, filesystem/posix, posix/mem, posix/cpu } > I0124 15:22:46.659951 2385417024 provisioner.cpp:299] Using default backend > 'copy' > I0124 15:22:46.661628 67534848 slave.cpp:262] Mesos agent started on > (1)@192.168.178.20:49682 > I0124 15:22:46.661669 67534848 slave.cpp:263] Flags at startup: > --appc_simple_discovery_uri_prefix="http://; > --appc_store_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/store/appc" > --authenticate_http_executors="false" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --container_disk_watch_interval="15secs" --containerizers="mesos" > --default_role="*" --disk_watch_interval="1mins" --docker="docker" > --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; > --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" > --docker_stop_timeout="0ns" > --docker_store_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/store/docker" > --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_registration_timeout="1mins" > --executor_reregistration_timeout="2secs" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/work/agents/0/fetch" > --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" > --gc_disk_headroom="0.1" --hadoop_home="" --help="false" > --hostname_lookup="true" --http_command_executor="false" > --http_heartbeat_interval="30secs" --initialize_driver_logging="true" > --isolation="posix/cpu,posix/mem" --launcher="posix" > --launcher_dir="/usr/local/libexec/mesos" --logbufsecs="0" > --logging_level="INFO" --max_completed_executors_per_framework="150" > --oversubscribed_resources_interval="15secs" --port="5051" > --qos_correction_interval_min="0ns" --quiet="false" > --reconfiguration_policy="equal" --recover="reconnect" > --recovery_timeout="15mins" --registration_backoff_factor="1secs" > --runtime_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/work/agents/0/run" > --sandbox_directory="/mnt/mesos/sandbox" --strict="true" > --switch_user="true" --version="false" > --work_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/work/agents/0/work" > --zk_session_timeout="10secs" > python(1780,0x74068000) malloc: *** error for object 0x106ac07c8: pointer > being freed was not allocated > *** set a breakpoint in malloc_error_break to debug > {noformat} > When running the {{PythonFramework}} via lldb, I get the following stacktrace: > {noformat} > * thread #7, stop reason = signal SIGABRT > * frame #0: 0x7fff55321e3e libsystem_kernel.dylib`__pthread_kill + 10 > frame #1: 0x7fff55460150 libsystem_pthread.dylib`pthread_kill + 333 > frame #2: 0x7fff5527e312 libsystem_c.dylib`abort + 127 > frame #3: 0x7fff5537b866 libsystem_malloc.dylib`free + 521 >
[jira] [Updated] (MESOS-8481) Agent reboot during checkpointing may result in empty checkpoints.
[ https://issues.apache.org/jira/browse/MESOS-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-8481: Priority: Blocker (was: Major) Target Version/s: 1.5.0 > Agent reboot during checkpointing may result in empty checkpoints. > -- > > Key: MESOS-8481 > URL: https://issues.apache.org/jira/browse/MESOS-8481 > Project: Mesos > Issue Type: Bug >Reporter: Chun-Hung Hsiao >Assignee: Michael Park >Priority: Blocker > > An empty checkpoint file was created due to the following incident. > At 17:12:25, the master assigned a task to an agent: > {noformat} > I0123 17:12:25.00 18618 master.cpp:11457] Adding task 5602 with resources > cpus(allocated: *):0.1; mem(allocated: *):128 on agent > aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@:5051 > () > I0123 17:12:25.00 18618 master.cpp:5017] Launching task 5602 of framework > 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112 (Balloon Framework OOM) at > scheduler-fbba22f7-ebbc-4864-8394-0aa558f8ffaa@:10015 with resources > [...] on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at > slave(1)@:5051 () > {noformat} > Meanwhile, the agent is being rebooted: > {noformat} > $ last reboot > reboot system boot 3.10.0-693.11.6. Tue Jan 23 17:14 - 00:09 (06:55) > {noformat} > The agent log did not show any information about the task, possibly because > there was no fsync before reboot: > {noformat} > I0123 17:12:09.00 17237 http.cpp:851] Authorizing principal > 'dcos_checks_agent' to GET the endpoint '/metrics/snapshot' > -- Reboot -- > I0123 17:15:40.00 2689 logsink.cpp:89] Added FileSink for glog logs to: > /var/log/mesos/mesos-agent.log > {noformat} > However, the agent was checkpointing the task before reboot: > {noformat} > $ sudo stat > /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/ > File: > ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/’ > Size: 39Blocks: 0 IO Block: 4096 directory > Device: ca40h/51776d Inode: 67306254Links: 3 > Access: (0755/drwxr-xr-x) Uid: (0/root) Gid: (0/root) > Context: system_u:object_r:unlabeled_t:s0 > Access: 2018-01-24 00:23:43.237322609 + > Modify: 2018-01-23 17:12:25.751463030 + > Change: 2018-01-23 17:12:25.751463030 + > Birth: - > {noformat} > And since there was no fsync before reboot, all checkpoints resulted in empty > files: > {noformat} > $ sudo stat > /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info > File: > ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info’ > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: ca40h/51776dInode: 33967500Links: 1 > Access: (0600/-rw---) Uid: (0/root) Gid: (0/root) > Context: system_u:object_r:unlabeled_t:s0 > Access: 2018-01-23 17:15:41.485506070 + > Modify: 2018-01-23 17:12:25.749463047 + > Change: 2018-01-23 17:12:25.749463047 + > Birth: - > $ sudo stat > /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid > File: > ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid’ > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: ca40h/51776dInode: 33967495Links: 1 > Access: (0600/-rw---) Uid: (0/root) Gid: (0/root) > Context: system_u:object_r:unlabeled_t:s0 > Access: 2018-01-23 23:00:42.190975780 + > Modify: 2018-01-23 17:12:25.749463047 + > Change: 2018-01-23 17:12:25.749463047 + > Birth: - > $ sudo stat > /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info > File: > ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info’ > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: ca40h/51776d Inode: 67306255Links: 1 > Access: (0600/-rw---) Uid: (0/root) Gid: (0/root) > Context: system_u:object_r:unlabeled_t:s0 > Access: 2018-01-23 17:12:25.751463030 + > Modify: 2018-01-23 17:12:25.751463030 + > Change: 2018-01-23 17:12:25.751463030 + > Birth: - > {noformat} > So were
[jira] [Updated] (MESOS-8484) stout test NumifyTest.HexNumberTest fails.
[ https://issues.apache.org/jira/browse/MESOS-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-8484: Affects Version/s: 1.6.0 > stout test NumifyTest.HexNumberTest fails. > --- > > Key: MESOS-8484 > URL: https://issues.apache.org/jira/browse/MESOS-8484 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Till Toenshoff >Priority: Blocker > > The current Mesos master shows the following on my machine: > {noformat} > [ RUN ] NumifyTest.HexNumberTest > ../../../3rdparty/stout/tests/numify_tests.cpp:57: Failure > Value of: numify("0x10.9").isError() > Actual: false > Expected: true > ../../../3rdparty/stout/tests/numify_tests.cpp:58: Failure > Value of: numify("0x1p-5").isError() > Actual: false > Expected: true > [ FAILED ] NumifyTest.HexNumberTest (0 ms) > {noformat} > This problem disappears for me when reverting the latest boost upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8484) stout test NumifyTest.HexNumberTest fails.
Till Toenshoff created MESOS-8484: - Summary: stout test NumifyTest.HexNumberTest fails. Key: MESOS-8484 URL: https://issues.apache.org/jira/browse/MESOS-8484 Project: Mesos Issue Type: Bug Reporter: Till Toenshoff The current Mesos master shows the following on my machine: {noformat} [ RUN ] NumifyTest.HexNumberTest ../../../3rdparty/stout/tests/numify_tests.cpp:57: Failure Value of: numify("0x10.9").isError() Actual: false Expected: true ../../../3rdparty/stout/tests/numify_tests.cpp:58: Failure Value of: numify("0x1p-5").isError() Actual: false Expected: true [ FAILED ] NumifyTest.HexNumberTest (0 ms) {noformat} This problem disappears for me when reverting the latest boost upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-3160) CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseRSS Flaky
[ https://issues.apache.org/jira/browse/MESOS-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-3160: - Story Points: 3 Sprint: Mesosphere Sprint 73 > CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseRSS Flaky > > > Key: MESOS-3160 > URL: https://issues.apache.org/jira/browse/MESOS-3160 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.24.0, 0.26.0 > Environment: Ubuntu 14.04 > CentOS 7 >Reporter: Paul Brett >Assignee: Greg Mann >Priority: Major > Labels: cgroups, flaky-test, mesosphere > > Test will occasionally with: > [ RUN ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseUnlockedRSS > ../../src/tests/containerizer/cgroups_tests.cpp:1103: Failure > helper.increaseRSS(getpagesize()): Failed to sync with the subprocess > ../../src/tests/containerizer/cgroups_tests.cpp:1103: Failure > helper.increaseRSS(getpagesize()): The subprocess has not been spawned yet > [ FAILED ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseUnlockedRSS > (223 ms) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6985) os::getenv() can segfault
[ https://issues.apache.org/jira/browse/MESOS-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16337966#comment-16337966 ] Vinod Kone commented on MESOS-6985: --- Should we re-assign this to someone else [~ipronin]? > os::getenv() can segfault > - > > Key: MESOS-6985 > URL: https://issues.apache.org/jira/browse/MESOS-6985 > Project: Mesos > Issue Type: Bug > Components: stout > Environment: ASF CI, Ubuntu 14.04 and CentOS 7 both with and without > libevent/SSL >Reporter: Greg Mann >Assignee: Ilya Pronin >Priority: Major > Labels: flaky-test, reliability, stout > Attachments: > MasterMaintenanceTest.InverseOffersFilters-truncated.txt, > MasterTest.MultipleExecutors.txt > > > This was observed on ASF CI. The segfault first showed up on CI on 9/20/16 > and has been produced by the tests {{MasterTest.MultipleExecutors}} and > {{MasterMaintenanceTest.InverseOffersFilters}}. In both cases, > {{os::getenv()}} segfaults with the same stack trace: > {code} > *** Aborted at 1485241617 (unix time) try "date -d @1485241617" if you are > using GNU date *** > PC: @ 0x2ad59e3ae82d (unknown) > I0124 07:06:57.422080 28619 exec.cpp:162] Version: 1.2.0 > *** SIGSEGV (@0xf0) received by PID 28591 (TID 0x2ad5a7b87700) from PID 240; > stack trace: *** > I0124 07:06:57.422336 28615 exec.cpp:212] Executor started at: > executor(75)@172.17.0.2:45752 with pid 28591 > @ 0x2ad5ab953197 (unknown) > @ 0x2ad5ab957479 (unknown) > @ 0x2ad59e165330 (unknown) > @ 0x2ad59e3ae82d (unknown) > @ 0x2ad594631358 os::getenv() > @ 0x2ad59aba6acf mesos::internal::slave::executorEnvironment() > @ 0x2ad59ab845c0 mesos::internal::slave::Framework::launchExecutor() > @ 0x2ad59ab818a2 mesos::internal::slave::Slave::_run() > @ 0x2ad59ac1ec10 > _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureIbEERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSF_INS1_13TaskGroupInfoEES6_S9_SC_SH_SL_EEvRKNS_3PIDIT_EEMSP_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES16_ > @ 0x2ad59ac1e6bf > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveERKNS0_6FutureIbEERKNS5_13FrameworkInfoERKNS5_12ExecutorInfoERK6OptionINS5_8TaskInfoEERKSJ_INS5_13TaskGroupInfoEESA_SD_SG_SL_SP_EEvRKNS0_3PIDIT_EEMST_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x2ad59bce2304 std::function<>::operator()() > @ 0x2ad59bcc9824 process::ProcessBase::visit() > @ 0x2ad59bd4028e process::DispatchEvent::visit() > @ 0x2ad594616df1 process::ProcessBase::serve() > @ 0x2ad59bcc72b7 process::ProcessManager::resume() > @ 0x2ad59bcd567c > process::ProcessManager::init_threads()::$_2::operator()() > @ 0x2ad59bcd5585 > _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE > @ 0x2ad59bcd std::_Bind_simple<>::operator()() > @ 0x2ad59bcd552c std::thread::_Impl<>::_M_run() > @ 0x2ad59d9e6a60 (unknown) > @ 0x2ad59e15d184 start_thread > @ 0x2ad59e46d37d (unknown) > make[4]: *** [check-local] Segmentation fault > {code} > Find attached the full log from a failed run of > {{MasterTest.MultipleExecutors}} and a truncated log from a failed run of > {{MasterMaintenanceTest.InverseOffersFilters}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8434) Cleanup Authorization logic in master and agent
[ https://issues.apache.org/jira/browse/MESOS-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rojas reassigned MESOS-8434: -- Assignee: Alexander Rojas > Cleanup Authorization logic in master and agent > --- > > Key: MESOS-8434 > URL: https://issues.apache.org/jira/browse/MESOS-8434 > Project: Mesos > Issue Type: Improvement > Components: agent, master >Affects Versions: 1.4.1 >Reporter: Alexander Rojas >Assignee: Alexander Rojas >Priority: Major > Labels: mesosphere, security > > During MesosCon EU 2017, [~benjaminhindman] came up with a neat abstraction > called [{{ObjectApprovers}}|https://reviews.apache.org/r/63258/] which go a > long way into streamlining and unifying the authorization used within mesos. > However, these patches became stale afterwards. > Given the benefits of such logic, we should really make the effort to land > these patches. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8483) ExampleTests PythonFramework fails with sigabort.
[ https://issues.apache.org/jira/browse/MESOS-8483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-8483: -- Affects Version/s: 1.5.0 > ExampleTests PythonFramework fails with sigabort. > - > > Key: MESOS-8483 > URL: https://issues.apache.org/jira/browse/MESOS-8483 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.5.0 > Environment: macOS 10.13.2 (17C88) > Python 2.7.10 (Apple's default - not homebrew) >Reporter: Till Toenshoff >Priority: Blocker > > Starting the {{PythonFramework}} manually results in a sigabort: > {noformat} > $ ./src/examples/python/test-framework local > [..] > I0124 15:22:46.637238 65925120 master.cpp:563] Using default 'crammd5' > authenticator > W0124 15:22:46.637269 65925120 authenticator.cpp:513] No credentials > provided, authentication requests will be refused > I0124 15:22:46.637284 65925120 authenticator.cpp:520] Initializing server SASL > I0124 15:22:46.659503 2385417024 resolver.cpp:69] Creating default secret > resolver > I0124 15:22:46.659624 2385417024 containerizer.cpp:304] Using isolation { > environment_secret, filesystem/posix, posix/mem, posix/cpu } > I0124 15:22:46.659951 2385417024 provisioner.cpp:299] Using default backend > 'copy' > I0124 15:22:46.661628 67534848 slave.cpp:262] Mesos agent started on > (1)@192.168.178.20:49682 > I0124 15:22:46.661669 67534848 slave.cpp:263] Flags at startup: > --appc_simple_discovery_uri_prefix="http://; > --appc_store_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/store/appc" > --authenticate_http_executors="false" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --container_disk_watch_interval="15secs" --containerizers="mesos" > --default_role="*" --disk_watch_interval="1mins" --docker="docker" > --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; > --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" > --docker_stop_timeout="0ns" > --docker_store_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/store/docker" > --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_registration_timeout="1mins" > --executor_reregistration_timeout="2secs" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/work/agents/0/fetch" > --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" > --gc_disk_headroom="0.1" --hadoop_home="" --help="false" > --hostname_lookup="true" --http_command_executor="false" > --http_heartbeat_interval="30secs" --initialize_driver_logging="true" > --isolation="posix/cpu,posix/mem" --launcher="posix" > --launcher_dir="/usr/local/libexec/mesos" --logbufsecs="0" > --logging_level="INFO" --max_completed_executors_per_framework="150" > --oversubscribed_resources_interval="15secs" --port="5051" > --qos_correction_interval_min="0ns" --quiet="false" > --reconfiguration_policy="equal" --recover="reconnect" > --recovery_timeout="15mins" --registration_backoff_factor="1secs" > --runtime_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/work/agents/0/run" > --sandbox_directory="/mnt/mesos/sandbox" --strict="true" > --switch_user="true" --version="false" > --work_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/work/agents/0/work" > --zk_session_timeout="10secs" > python(1780,0x74068000) malloc: *** error for object 0x106ac07c8: pointer > being freed was not allocated > *** set a breakpoint in malloc_error_break to debug > {noformat} > When running the {{PythonFramework}} via lldb, I get the following stacktrace: > {noformat} > * thread #7, stop reason = signal SIGABRT > * frame #0: 0x7fff55321e3e libsystem_kernel.dylib`__pthread_kill + 10 > frame #1: 0x7fff55460150 libsystem_pthread.dylib`pthread_kill + 333 > frame #2: 0x7fff5527e312 libsystem_c.dylib`abort + 127 > frame #3: 0x7fff5537b866 libsystem_malloc.dylib`free + 521 > frame #4: 0x00010d24daac > _scheduler.so`google::protobuf::internal::ArenaStringPtr::DestroyNoArena(this=0x7ac355b0, > default_value="") at arenastring.h:264 > frame #5: 0x00010d2fe1aa > _scheduler.so`mesos::Resource::SharedDtor(this=0x7ac35580) at > mesos.pb.cc:31016 > frame #6: 0x00010d2fe063 > _scheduler.so`mesos::Resource::~Resource(this=0x7ac35580) at > mesos.pb.cc:31011 > frame #7: 0x00010d2fe485 > _scheduler.so`mesos::Resource::~Resource(this=0x7ac35580) at > mesos.pb.cc:31009 > frame #8: 0x00010b0257c7 > _scheduler.so`mesos::Resources::parse(name="cpus", value="8",
[jira] [Updated] (MESOS-8483) ExampleTests PythonFramework fails with sigabort.
[ https://issues.apache.org/jira/browse/MESOS-8483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-8483: -- Environment: macOS 10.13.2 (17C88) Python 2.7.10 (Apple's default - not homebrew) > ExampleTests PythonFramework fails with sigabort. > - > > Key: MESOS-8483 > URL: https://issues.apache.org/jira/browse/MESOS-8483 > Project: Mesos > Issue Type: Bug > Environment: macOS 10.13.2 (17C88) > Python 2.7.10 (Apple's default - not homebrew) >Reporter: Till Toenshoff >Priority: Blocker > > Starting the {{PythonFramework}} manually results in a sigabort: > {noformat} > $ ./src/examples/python/test-framework local > [..] > I0124 15:22:46.637238 65925120 master.cpp:563] Using default 'crammd5' > authenticator > W0124 15:22:46.637269 65925120 authenticator.cpp:513] No credentials > provided, authentication requests will be refused > I0124 15:22:46.637284 65925120 authenticator.cpp:520] Initializing server SASL > I0124 15:22:46.659503 2385417024 resolver.cpp:69] Creating default secret > resolver > I0124 15:22:46.659624 2385417024 containerizer.cpp:304] Using isolation { > environment_secret, filesystem/posix, posix/mem, posix/cpu } > I0124 15:22:46.659951 2385417024 provisioner.cpp:299] Using default backend > 'copy' > I0124 15:22:46.661628 67534848 slave.cpp:262] Mesos agent started on > (1)@192.168.178.20:49682 > I0124 15:22:46.661669 67534848 slave.cpp:263] Flags at startup: > --appc_simple_discovery_uri_prefix="http://; > --appc_store_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/store/appc" > --authenticate_http_executors="false" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --container_disk_watch_interval="15secs" --containerizers="mesos" > --default_role="*" --disk_watch_interval="1mins" --docker="docker" > --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; > --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" > --docker_stop_timeout="0ns" > --docker_store_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/store/docker" > --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_registration_timeout="1mins" > --executor_reregistration_timeout="2secs" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/work/agents/0/fetch" > --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" > --gc_disk_headroom="0.1" --hadoop_home="" --help="false" > --hostname_lookup="true" --http_command_executor="false" > --http_heartbeat_interval="30secs" --initialize_driver_logging="true" > --isolation="posix/cpu,posix/mem" --launcher="posix" > --launcher_dir="/usr/local/libexec/mesos" --logbufsecs="0" > --logging_level="INFO" --max_completed_executors_per_framework="150" > --oversubscribed_resources_interval="15secs" --port="5051" > --qos_correction_interval_min="0ns" --quiet="false" > --reconfiguration_policy="equal" --recover="reconnect" > --recovery_timeout="15mins" --registration_backoff_factor="1secs" > --runtime_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/work/agents/0/run" > --sandbox_directory="/mnt/mesos/sandbox" --strict="true" > --switch_user="true" --version="false" > --work_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/work/agents/0/work" > --zk_session_timeout="10secs" > python(1780,0x74068000) malloc: *** error for object 0x106ac07c8: pointer > being freed was not allocated > *** set a breakpoint in malloc_error_break to debug > {noformat} > When running the {{PythonFramework}} via lldb, I get the following stacktrace: > {noformat} > * thread #7, stop reason = signal SIGABRT > * frame #0: 0x7fff55321e3e libsystem_kernel.dylib`__pthread_kill + 10 > frame #1: 0x7fff55460150 libsystem_pthread.dylib`pthread_kill + 333 > frame #2: 0x7fff5527e312 libsystem_c.dylib`abort + 127 > frame #3: 0x7fff5537b866 libsystem_malloc.dylib`free + 521 > frame #4: 0x00010d24daac > _scheduler.so`google::protobuf::internal::ArenaStringPtr::DestroyNoArena(this=0x7ac355b0, > default_value="") at arenastring.h:264 > frame #5: 0x00010d2fe1aa > _scheduler.so`mesos::Resource::SharedDtor(this=0x7ac35580) at > mesos.pb.cc:31016 > frame #6: 0x00010d2fe063 > _scheduler.so`mesos::Resource::~Resource(this=0x7ac35580) at > mesos.pb.cc:31011 > frame #7: 0x00010d2fe485 > _scheduler.so`mesos::Resource::~Resource(this=0x7ac35580) at > mesos.pb.cc:31009 > frame #8: 0x00010b0257c7 >
[jira] [Created] (MESOS-8483) ExampleTests PythonFramework fails with sigabort.
Till Toenshoff created MESOS-8483: - Summary: ExampleTests PythonFramework fails with sigabort. Key: MESOS-8483 URL: https://issues.apache.org/jira/browse/MESOS-8483 Project: Mesos Issue Type: Bug Reporter: Till Toenshoff Starting the {{PythonFramework}} manually results in a sigabort: {noformat} $ ./src/examples/python/test-framework local [..] I0124 15:22:46.637238 65925120 master.cpp:563] Using default 'crammd5' authenticator W0124 15:22:46.637269 65925120 authenticator.cpp:513] No credentials provided, authentication requests will be refused I0124 15:22:46.637284 65925120 authenticator.cpp:520] Initializing server SASL I0124 15:22:46.659503 2385417024 resolver.cpp:69] Creating default secret resolver I0124 15:22:46.659624 2385417024 containerizer.cpp:304] Using isolation { environment_secret, filesystem/posix, posix/mem, posix/cpu } I0124 15:22:46.659951 2385417024 provisioner.cpp:299] Using default backend 'copy' I0124 15:22:46.661628 67534848 slave.cpp:262] Mesos agent started on (1)@192.168.178.20:49682 I0124 15:22:46.661669 67534848 slave.cpp:263] Flags at startup: --appc_simple_discovery_uri_prefix="http://; --appc_store_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/store/appc" --authenticate_http_executors="false" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_reregistration_timeout="2secs" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/work/agents/0/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname_lookup="true" --http_command_executor="false" --http_heartbeat_interval="30secs" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher="posix" --launcher_dir="/usr/local/libexec/mesos" --logbufsecs="0" --logging_level="INFO" --max_completed_executors_per_framework="150" --oversubscribed_resources_interval="15secs" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --reconfiguration_policy="equal" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --runtime_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/work/agents/0/run" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --version="false" --work_dir="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/mesos/work/agents/0/work" --zk_session_timeout="10secs" python(1780,0x74068000) malloc: *** error for object 0x106ac07c8: pointer being freed was not allocated *** set a breakpoint in malloc_error_break to debug {noformat} When running the {{PythonFramework}} via lldb, I get the following stacktrace: {noformat} * thread #7, stop reason = signal SIGABRT * frame #0: 0x7fff55321e3e libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x7fff55460150 libsystem_pthread.dylib`pthread_kill + 333 frame #2: 0x7fff5527e312 libsystem_c.dylib`abort + 127 frame #3: 0x7fff5537b866 libsystem_malloc.dylib`free + 521 frame #4: 0x00010d24daac _scheduler.so`google::protobuf::internal::ArenaStringPtr::DestroyNoArena(this=0x7ac355b0, default_value="") at arenastring.h:264 frame #5: 0x00010d2fe1aa _scheduler.so`mesos::Resource::SharedDtor(this=0x7ac35580) at mesos.pb.cc:31016 frame #6: 0x00010d2fe063 _scheduler.so`mesos::Resource::~Resource(this=0x7ac35580) at mesos.pb.cc:31011 frame #7: 0x00010d2fe485 _scheduler.so`mesos::Resource::~Resource(this=0x7ac35580) at mesos.pb.cc:31009 frame #8: 0x00010b0257c7 _scheduler.so`mesos::Resources::parse(name="cpus", value="8", role="*") at resources.cpp:702 frame #9: 0x00010c7ae4c9 _scheduler.so`mesos::internal::slave::Containerizer::resources(flags=0x00010202bac0) at containerizer.cpp:118 frame #10: 0x00010c3a93e1 _scheduler.so`mesos::internal::slave::Slave::initialize(this=0x00010202ba00) at slave.cpp:472 frame #11: 0x00010c3d7cb2 _scheduler.so`virtual thunk to mesos::internal::slave::Slave::initialize(this=0x00010202ba00) at slave.cpp:0 frame #12: 0x00010e459c39
[jira] [Updated] (MESOS-8482) Signed/Unsigned comparisons in tests
[ https://issues.apache.org/jira/browse/MESOS-8482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-8482: --- Description: Many tests in mesos currently have comparisons between signed and unsigned integers, eg {noformat} ASSERT_EQ(4, v1Response->read_file().size()); {noformat} or comparisons between values of different enums, e.g. TaskState and v1::TaskState: {noformat} ASSERT_EQ(TASK_STARTING, startingUpdate->status().state()); {noformat} Usually, the compiler would catch these and emit a warning, but these are currently silenced because gtest headers are included using the {{-isystem}} command line flag. was: Many tests in mesos currently have comparisons between signed and unsigned integers, eg {noformat} ASSERT_EQ(4, v1Response->read_file().size()); {noformat} or comparisons between values of different enums, e.g. TaskState and v1::TaskState: {noformat} ASSERT_EQ(TASK_STARTING, startingUpdate->status().state()); {noformat} Usually, the compiler would catch these and emit a warning, but these are currently silenced because gtest headers are included using the `-isystem` command line flag. > Signed/Unsigned comparisons in tests > > > Key: MESOS-8482 > URL: https://issues.apache.org/jira/browse/MESOS-8482 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > Labels: mesosphere, newbie, tech-debt > > Many tests in mesos currently have comparisons between signed and unsigned > integers, eg > {noformat} > ASSERT_EQ(4, v1Response->read_file().size()); > {noformat} > or comparisons between values of different enums, e.g. TaskState and > v1::TaskState: > {noformat} > ASSERT_EQ(TASK_STARTING, startingUpdate->status().state()); > {noformat} > Usually, the compiler would catch these and emit a warning, but these are > currently silenced because gtest headers are included using the {{-isystem}} > command line flag. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (MESOS-8482) Signed/Unsigned comparisons in tests
[ https://issues.apache.org/jira/browse/MESOS-8482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-8482: --- Labels: mesosphere newbie tech-debt (was: ) > Signed/Unsigned comparisons in tests > > > Key: MESOS-8482 > URL: https://issues.apache.org/jira/browse/MESOS-8482 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > Labels: mesosphere, newbie, tech-debt > > Many tests in mesos currently have comparisons between signed and unsigned > integers, eg > {noformat} > ASSERT_EQ(4, v1Response->read_file().size()); > {noformat} > or comparisons between values of different enums, e.g. TaskState and > v1::TaskState: > {noformat} > ASSERT_EQ(TASK_STARTING, startingUpdate->status().state()); > {noformat} > Usually, the compiler would catch these and emit a warning, but these are > currently silenced because gtest headers are included using the `-isystem` > command line flag. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8482) Signed/Unsigned comparisons in tests
Benno Evers created MESOS-8482: -- Summary: Signed/Unsigned comparisons in tests Key: MESOS-8482 URL: https://issues.apache.org/jira/browse/MESOS-8482 Project: Mesos Issue Type: Bug Reporter: Benno Evers Many tests in mesos currently have comparisons between signed and unsigned integers, eg {noformat} ASSERT_EQ(4, v1Response->read_file().size()); {noformat} or comparisons between values of different enums, e.g. TaskState and v1::TaskState: {noformat} ASSERT_EQ(TASK_STARTING, startingUpdate->status().state()); {noformat} Usually, the compiler would catch these and emit a warning, but these are currently silenced because gtest headers are included using the `-isystem` command line flag. -- This message was sent by Atlassian JIRA (v7.6.3#76005)