[ 
https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034948#comment-15034948
 ] 

Joseph Wu commented on MESOS-3586:
----------------------------------

This race _almost_ seems unavoidable (at least, given the test currently), and 
I don't think the sleep duration is really a problem.

*Background*
Both tests are essentially hammering away at memory, resulting in "memory 
pressure".  Depending on the load (low, medium, critical), this triggers some 
cgroup status events.  By definition, the "low" pressure event is always 
triggered whenever there is any pressure at all:
{quote}
Application will be notified through eventfd when memory pressure is at
the specific level (or higher).
{quote}
[Reference section "11. Memory 
Pressure"|https://www.kernel.org/doc/Documentation/cgroups/memory.txt]

In the tests, we check this by expecting "number of low pressure events" >= 
"number of medium pressure events" >= "number of critical pressure events".

*Problem*
There's no guarantee of the order of notification.  When we read from our 
memory pressure counters, there might be some events in-flight that haven't 
been processed yet.  Therefore, we occasionally see our expectations betrayed.

*???*
The memory pressure event counts should be eventually consistent with our 
expectations.  So the test should probably:
* Stop the memory-hammering task at some point.
* Wait for all pressure events to be processed.
* Then check the counters.

> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and 
> CGROUPS_ROOT_SlaveRecovery are flaky
> ----------------------------------------------------------------------------------------
>
>                 Key: MESOS-3586
>                 URL: https://issues.apache.org/jira/browse/MESOS-3586
>             Project: Mesos
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.24.0, 0.26.0
>         Environment: Ubuntu 14.04, 3.13.0-32 generic
> Debian 8, gcc 4.9.2
>            Reporter: Miguel Bernadin
>              Labels: flaky, flaky-test
>
> I am install Mesos 0.24.0 on 4 servers which have very similar hardware and 
> software configurations. 
> After performing ../configure, make, and make check some servers have 
> completed successfully and other failed on test [ RUN      ] 
> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics.
> Is there something I should check in this test? 
> PERFORMED MAKE CHECK NODE-001
> [ RUN      ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0
> I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 
> 20151005-143735-2393768202-35106-27900-S0
> Registered executor on svdidac038.techlabs.accenture.com
> Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0
> Forked command at 38510
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> PERFORMED MAKE CHECK NODE-002
> [ RUN      ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0
> I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave 
> 20151005-143857-2360213770-50427-26325-S0
> Registered executor on svdidac039.techlabs.accenture.com
> Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> Forked command at 37028
> ../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure
> Expected: (usage.get().mem_medium_pressure_counter()) >= 
> (usage.get().mem_critical_pressure_counter()), actual: 5 vs 6
> 2015-10-05 
> 14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> [  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to