[
https://issues.apache.org/jira/browse/MESOS-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149511#comment-16149511
]
Andrew Schwartzmeyer commented on MESOS-7929:
---------------------------------------------
This bug is marked critical as it is blocking multiple developers from being
able to run a full {{mesos-tests}} pass.
> `Metrics()` hangs on second call on Windows
> -------------------------------------------
>
> Key: MESOS-7929
> URL: https://issues.apache.org/jira/browse/MESOS-7929
> Project: Mesos
> Issue Type: Bug
> Environment: Windows 10 Enterprise Build 15063 (and also confirmed on
> 14393).
> Reporter: Andrew Schwartzmeyer
> Priority: Critical
> Labels: windows
>
> An unfortunately difficult to debug problem has cropped up on Windows. While
> running the {{mesos-tests}} they will hang at:
> {noformat}
> [==========] Running 2 tests from 2 test cases.
> [----------] Global test environment set-up.
> [----------] 1 test from FetcherTest
> [ RUN ] FetcherTest.MalformedURI
> [ OK ] FetcherTest.MalformedURI (48 ms)
> [----------] 1 test from FetcherTest (63 ms total)
> [----------] 1 test from GarbageCollectorTest
> [ RUN ] GarbageCollectorTest.Schedule
> C:\Users\andschwa\src\mesos-master\src\tests\utils.cpp(64): error: Failed to
> wait 15secs for response
> C:\Users\andschwa\src\mesos-master\src\tests\utils.cpp(65): error: Failed to
> wait 15secs for response
> {noformat}
> {{GarbageCollectorTest.Schedule}} is the first test that will hang in an
> unfiltered run of mesos-tests.
> This can be minimally reproduced by running any two tests which call
> {{Metrics()}} from {{utils.cpp}}. The following have been confirmed:
> {noformat}
> --gtest_filter="GarbageCollectorTest.Schedule:HierarchicalAllocatorTest.OfferFilter"
> --gtest_filter="GarbageCollectorTest.Schedule:FetcherTest.MalformedURI"
> --gtest_filter="HierarchicalAllocatorTest.OfferFilter:FetcherTest.MalformedURI"
> {noformat}
> The second test will hang (indicating a race condition), waiting for a
> {{GET}} to {{/metrics/snapshot}} that never returns.
> There appears to be a timing problem to this bug as well. If your CPU is
> heavily utilized (say, by running another build in the background), the tests
> will pass. They will pass if you attach Application Verifier to
> {{mesos-tests.exe}}, which slows down execution enough. Very slow machines
> (such as those used for CI) will also not exhibit this hang.
> Oddly, the bug will reproduce under the Visual Studio debugger, but all it
> shows us is a pending future waiting for the metrics request to come back.
> In {{metrics.cpp}} there is a note that the request might timeout, but we're
> unsure if this is the same problem, or a different problem manifesting in the
> same way:
> {noformat}
> // TODO(neilc): This request might timeout if the current value of a
> // metric cannot be determined. In tests, a common cause for this is
> // MESOS-6231 when multiple scheduler drivers are in use.
> {noformat}
> A {{git bisect}} revealed that:
> {noformat}
> 20c5311434e45a631ffc6036d327e00b2228ad26 is the first bad commit
> commit 20c5311434e45a631ffc6036d327e00b2228ad26
> Author: James Peach <[email protected]>
> Date: Tue Aug 22 16:19:47 2017 -0700
> Added agent garbage collection metrics.
> Added some basic sandbox garbage collection metrics to track the number
> of successful, failed and pending path removals.
> Review: https://reviews.apache.org/r/61260/
> {noformat}
> Caused this bug to appear (but does not necessarily mean it created the bug).
> Reverting this commit allows all the tests to pass, but we believe this just
> hides the bug.
> This bug has reproduced on Windows machines with and without Docker (and
> Windows containers) installed. (I only mention this because it was a variable
> on my machine when the bug first appeared, but have since ruled it out as
> relevant.)
> We do not think that it is specific to {{libevent}}, as the bug does not
> appear to reproduce on a Linux VM built with {{libevent}} instead of
> {{libev}}.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)