[ 
https://issues.apache.org/jira/browse/MESOS-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149511#comment-16149511
 ] 

Andrew Schwartzmeyer commented on MESOS-7929:
---------------------------------------------

This bug is marked critical as it is blocking multiple developers from being 
able to run a full {{mesos-tests}} pass.

> `Metrics()` hangs on second call on Windows
> -------------------------------------------
>
>                 Key: MESOS-7929
>                 URL: https://issues.apache.org/jira/browse/MESOS-7929
>             Project: Mesos
>          Issue Type: Bug
>         Environment: Windows 10 Enterprise Build 15063 (and also confirmed on 
> 14393).
>            Reporter: Andrew Schwartzmeyer
>            Priority: Critical
>              Labels: windows
>
> An unfortunately difficult to debug problem has cropped up on Windows. While 
> running the {{mesos-tests}} they will hang at:
> {noformat}
> [==========] Running 2 tests from 2 test cases.
> [----------] Global test environment set-up.
> [----------] 1 test from FetcherTest
> [ RUN      ] FetcherTest.MalformedURI
> [       OK ] FetcherTest.MalformedURI (48 ms)
> [----------] 1 test from FetcherTest (63 ms total)
> [----------] 1 test from GarbageCollectorTest
> [ RUN      ] GarbageCollectorTest.Schedule
> C:\Users\andschwa\src\mesos-master\src\tests\utils.cpp(64): error: Failed to 
> wait 15secs for response
> C:\Users\andschwa\src\mesos-master\src\tests\utils.cpp(65): error: Failed to 
> wait 15secs for response
> {noformat}
> {{GarbageCollectorTest.Schedule}} is the first test that will hang in an 
> unfiltered run of mesos-tests.
> This can be minimally reproduced by running any two tests which call 
> {{Metrics()}} from {{utils.cpp}}. The following have been confirmed:
> {noformat}
> --gtest_filter="GarbageCollectorTest.Schedule:HierarchicalAllocatorTest.OfferFilter"
> --gtest_filter="GarbageCollectorTest.Schedule:FetcherTest.MalformedURI"
> --gtest_filter="HierarchicalAllocatorTest.OfferFilter:FetcherTest.MalformedURI"
> {noformat}
> The second test will hang (indicating a race condition), waiting for a 
> {{GET}} to {{/metrics/snapshot}} that never returns.
> There appears to be a timing problem to this bug as well. If your CPU is 
> heavily utilized (say, by running another build in the background), the tests 
> will pass. They will pass if you attach Application Verifier to 
> {{mesos-tests.exe}}, which slows down execution enough. Very slow machines 
> (such as those used for CI) will also not exhibit this hang.
> Oddly, the bug will reproduce under the Visual Studio debugger, but all it 
> shows us is a pending future waiting for the metrics request to come back.
> In {{metrics.cpp}} there is a note that the request might timeout, but we're 
> unsure if this is the same problem, or a different problem manifesting in the 
> same way:
> {noformat}
>   // TODO(neilc): This request might timeout if the current value of a
>   // metric cannot be determined. In tests, a common cause for this is
>   // MESOS-6231 when multiple scheduler drivers are in use.
> {noformat}
> A {{git bisect}} revealed that:
> {noformat}
> 20c5311434e45a631ffc6036d327e00b2228ad26 is the first bad commit
> commit 20c5311434e45a631ffc6036d327e00b2228ad26
> Author: James Peach <[email protected]>
> Date:   Tue Aug 22 16:19:47 2017 -0700
>     Added agent garbage collection metrics.
>     Added some basic sandbox garbage collection metrics to track the number
>     of successful, failed and pending path removals.
>     Review: https://reviews.apache.org/r/61260/
> {noformat}
> Caused this bug to appear (but does not necessarily mean it created the bug). 
> Reverting this commit allows all the tests to pass, but we believe this just 
> hides the bug.
> This bug has reproduced on Windows machines with and without Docker (and 
> Windows containers) installed. (I only mention this because it was a variable 
> on my machine when the bug first appeared, but have since ruled it out as 
> relevant.)
> We do not think that it is specific to {{libevent}}, as the bug does not 
> appear to reproduce on a Linux VM built with {{libevent}} instead of 
> {{libev}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to