Andrew Schwartzmeyer created MESOS-7929:
-------------------------------------------

             Summary: `Metrics()` hangs on second call on Windows
                 Key: MESOS-7929
                 URL: https://issues.apache.org/jira/browse/MESOS-7929
             Project: Mesos
          Issue Type: Bug
         Environment: Windows 10 Enterprise Build 15063 (and also confirmed on 
14393).
            Reporter: Andrew Schwartzmeyer
            Priority: Critical


An unfortunately difficult to debug problem has cropped up on Windows. While 
running the {{mesos-tests}} they will hang at:

{noformat}
[==========] Running 2 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 1 test from FetcherTest
[ RUN      ] FetcherTest.MalformedURI
[       OK ] FetcherTest.MalformedURI (48 ms)
[----------] 1 test from FetcherTest (63 ms total)

[----------] 1 test from GarbageCollectorTest
[ RUN      ] GarbageCollectorTest.Schedule
C:\Users\andschwa\src\mesos-master\src\tests\utils.cpp(64): error: Failed to 
wait 15secs for response
C:\Users\andschwa\src\mesos-master\src\tests\utils.cpp(65): error: Failed to 
wait 15secs for response
{noformat}

{{GarbageCollectorTest.Schedule}} is the first test that will hang in an 
unfiltered run of mesos-tests.

This can be minimally reproduced by running any two tests which call 
{{Metrics()}} from {{utils.cpp}}. The following have been confirmed:

{noformat}
--gtest_filter="GarbageCollectorTest.Schedule:HierarchicalAllocatorTest.OfferFilter"
--gtest_filter="GarbageCollectorTest.Schedule:FetcherTest.MalformedURI"
--gtest_filter="HierarchicalAllocatorTest.OfferFilter:FetcherTest.MalformedURI"
{noformat}

The second test will hang (indicating a race condition), waiting for a {{GET}} 
to {{/metrics/snapshot}} that never returns.

There appears to be a timing problem to this bug as well. If your CPU is 
heavily utilized (say, by running another build in the background), the tests 
will pass. They will pass if you attach Application Verifier to 
{{mesos-tests.exe}}, which slows down execution enough. Very slow machines 
(such as those used for CI) will also not exhibit this hang.

Oddly, the bug will reproduce under the Visual Studio debugger, but all it 
shows us is a pending future waiting for the metrics request to come back.

In {{metrics.cpp}} there is a note that the request might timeout, but we're 
unsure if this is the same problem, or a different problem manifesting in the 
same way:

{noformat}
  // TODO(neilc): This request might timeout if the current value of a
  // metric cannot be determined. In tests, a common cause for this is
  // MESOS-6231 when multiple scheduler drivers are in use.
{noformat}

A {{git bisect}} revealed that:

{noformat}
20c5311434e45a631ffc6036d327e00b2228ad26 is the first bad commit
commit 20c5311434e45a631ffc6036d327e00b2228ad26
Author: James Peach <[email protected]>
Date:   Tue Aug 22 16:19:47 2017 -0700

    Added agent garbage collection metrics.

    Added some basic sandbox garbage collection metrics to track the number
    of successful, failed and pending path removals.

    Review: https://reviews.apache.org/r/61260/
{noformat}

Caused this bug to appear (but does not necessarily mean it created the bug). 
Reverting this commit allows all the tests to pass, but we believe this just 
hides the bug.

This bug has reproduced on Windows machines with and without Docker (and 
Windows containers) installed. (I only mention this because it was a variable 
on my machine when the bug first appeared, but have since ruled it out as 
relevant.)

We do not think that it is specific to {{libevent}}, as the bug does not appear 
to reproduce on a Linux VM built with {{libevent}} instead of {{libev}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to