Andrew Schwartzmeyer created MESOS-7929:
-------------------------------------------
Summary: `Metrics()` hangs on second call on Windows
Key: MESOS-7929
URL: https://issues.apache.org/jira/browse/MESOS-7929
Project: Mesos
Issue Type: Bug
Environment: Windows 10 Enterprise Build 15063 (and also confirmed on
14393).
Reporter: Andrew Schwartzmeyer
Priority: Critical
An unfortunately difficult to debug problem has cropped up on Windows. While
running the {{mesos-tests}} they will hang at:
{noformat}
[==========] Running 2 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 1 test from FetcherTest
[ RUN ] FetcherTest.MalformedURI
[ OK ] FetcherTest.MalformedURI (48 ms)
[----------] 1 test from FetcherTest (63 ms total)
[----------] 1 test from GarbageCollectorTest
[ RUN ] GarbageCollectorTest.Schedule
C:\Users\andschwa\src\mesos-master\src\tests\utils.cpp(64): error: Failed to
wait 15secs for response
C:\Users\andschwa\src\mesos-master\src\tests\utils.cpp(65): error: Failed to
wait 15secs for response
{noformat}
{{GarbageCollectorTest.Schedule}} is the first test that will hang in an
unfiltered run of mesos-tests.
This can be minimally reproduced by running any two tests which call
{{Metrics()}} from {{utils.cpp}}. The following have been confirmed:
{noformat}
--gtest_filter="GarbageCollectorTest.Schedule:HierarchicalAllocatorTest.OfferFilter"
--gtest_filter="GarbageCollectorTest.Schedule:FetcherTest.MalformedURI"
--gtest_filter="HierarchicalAllocatorTest.OfferFilter:FetcherTest.MalformedURI"
{noformat}
The second test will hang (indicating a race condition), waiting for a {{GET}}
to {{/metrics/snapshot}} that never returns.
There appears to be a timing problem to this bug as well. If your CPU is
heavily utilized (say, by running another build in the background), the tests
will pass. They will pass if you attach Application Verifier to
{{mesos-tests.exe}}, which slows down execution enough. Very slow machines
(such as those used for CI) will also not exhibit this hang.
Oddly, the bug will reproduce under the Visual Studio debugger, but all it
shows us is a pending future waiting for the metrics request to come back.
In {{metrics.cpp}} there is a note that the request might timeout, but we're
unsure if this is the same problem, or a different problem manifesting in the
same way:
{noformat}
// TODO(neilc): This request might timeout if the current value of a
// metric cannot be determined. In tests, a common cause for this is
// MESOS-6231 when multiple scheduler drivers are in use.
{noformat}
A {{git bisect}} revealed that:
{noformat}
20c5311434e45a631ffc6036d327e00b2228ad26 is the first bad commit
commit 20c5311434e45a631ffc6036d327e00b2228ad26
Author: James Peach <[email protected]>
Date: Tue Aug 22 16:19:47 2017 -0700
Added agent garbage collection metrics.
Added some basic sandbox garbage collection metrics to track the number
of successful, failed and pending path removals.
Review: https://reviews.apache.org/r/61260/
{noformat}
Caused this bug to appear (but does not necessarily mean it created the bug).
Reverting this commit allows all the tests to pass, but we believe this just
hides the bug.
This bug has reproduced on Windows machines with and without Docker (and
Windows containers) installed. (I only mention this because it was a variable
on my machine when the bug first appeared, but have since ruled it out as
relevant.)
We do not think that it is specific to {{libevent}}, as the bug does not appear
to reproduce on a Linux VM built with {{libevent}} instead of {{libev}}.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)