[ 
https://issues.apache.org/jira/browse/MESOS-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932704#comment-13932704
 ] 

Yan Xu edited comment on MESOS-1088 at 3/13/14 12:43 AM:
---------------------------------------------------------

I suspect that the problem lies in the 
[latch|https://github.com/apache/mesos/blob/ea1ce107bb2aadc947563f1b59c7d08d1b7125f3/3rdparty/libprocess/src/latch.cpp].

{code:title=latch.cpp}
void Latch::trigger()
{
  if (!triggered) {
    terminate(pid);
{code}

It's possible for {{process::wait(pid, duration)}} below to return and in turn, 
{{Latch::await(...)}} to return {{false}} before the execution the next line, 
right?

{code:title=latch.cpp (continued)}
    triggered = true;
  }
}


bool Latch::await(const Duration& duration)
{
  if (!triggered) {
    process::wait(pid, duration); // Explict to disambiguate.
    // It's possible that we failed to wait because:
    //   (1) Our process has already terminated.
    //   (2) We timed out (i.e., duration was not "infinite").

    // In the event of (1) we might need to return 'true' since a
    // terminated process might imply that the latch has been
    // triggered. To capture this we simply return the value of
    // 'triggered' (which will also capture cases where we actually
    // timed out but have since triggered, which seems like an
    // acceptable semantics given such a "tie").
    return triggered;
  }

  return true;
}
{code}


was (Author: xujyan):
I suspect that it's the problem lies in the 
[latch|https://github.com/apache/mesos/blob/ea1ce107bb2aadc947563f1b59c7d08d1b7125f3/3rdparty/libprocess/src/latch.cpp].

{code:title=latch.cpp}
void Latch::trigger()
{
  if (!triggered) {
    terminate(pid);
{code}

It's possible for {{process::wait(pid, duration)}} below to return and in turn, 
{{Latch::await(...)}} to return {{false}} before the execution the next line, 
right?

{code:title=latch.cpp (continued)}
    triggered = true;
  }
}


bool Latch::await(const Duration& duration)
{
  if (!triggered) {
    process::wait(pid, duration); // Explict to disambiguate.
    // It's possible that we failed to wait because:
    //   (1) Our process has already terminated.
    //   (2) We timed out (i.e., duration was not "infinite").

    // In the event of (1) we might need to return 'true' since a
    // terminated process might imply that the latch has been
    // triggered. To capture this we simply return the value of
    // 'triggered' (which will also capture cases where we actually
    // timed out but have since triggered, which seems like an
    // acceptable semantics given such a "tie").
    return triggered;
  }

  return true;
}
{code}

> ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster
>  is flaky
> -----------------------------------------------------------------------------------------
>
>                 Key: MESOS-1088
>                 URL: https://issues.apache.org/jira/browse/MESOS-1088
>             Project: Mesos
>          Issue Type: Bug
>          Components: test
>            Reporter: Yan Xu
>            Assignee: Yan Xu
>             Fix For: 0.19.0
>
>
> {code}
> [ RUN      ] 
> ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster
> I0312 15:50:02.733414  2029 zookeeper_test_server.cpp:158] Started 
> ZooKeeperTestServer on port 32925
> 2014-03-12 15:50:02,733:2029(0x7fc285609700):ZOO_INFO@log_env@712: Client 
> environment:zookeeper.version=zookeeper C client 3.4.5
> 2014-03-12 15:50:02,733:2029(0x7fc285609700):ZOO_INFO@log_env@716: Client 
> environment:host.name=fedora-20
> 2014-03-12 15:50:02,734:2029(0x7fc285609700):ZOO_INFO@log_env@723: Client 
> environment:os.name=Linux
> 2014-03-12 15:50:02,734:2029(0x7fc285609700):ZOO_INFO@log_env@724: Client 
> environment:os.arch=3.13.6-200.fc20.x86_64
> 2014-03-12 15:50:02,734:2029(0x7fc285609700):ZOO_INFO@log_env@725: Client 
> environment:os.version=#1 SMP Fri Mar 7 17:02:28 UTC 2014
> 2014-03-12 15:50:02,734:2029(0x7fc285609700):ZOO_INFO@log_env@733: Client 
> environment:user.name=jenkins
> 2014-03-12 15:50:02,735:2029(0x7fc285609700):ZOO_INFO@log_env@741: Client 
> environment:user.home=/home/jenkins
> 2014-03-12 15:50:02,735:2029(0x7fc285609700):ZOO_INFO@log_env@753: Client 
> environment:user.dir=/var/jenkins/workspace/vinod-test/compiler/clang/os/fedora-20/src
> 2014-03-12 15:50:02,735:2029(0x7fc285609700):ZOO_INFO@zookeeper_init@786: 
> Initiating client connection, host=127.0.0.1:32925 sessionTimeout=10000 
> watcher=0x7fc28df599f0 sessionId=0 sessionPasswd=<null> 
> context=0x7fc264019490 flags=0
> I0312 15:50:02.738956  2050 contender.cpp:127] Joining the ZK group
> 2014-03-12 15:50:02,743:2029(0x7fc2532d1700):ZOO_INFO@check_events@1703: 
> initiated connection to server [127.0.0.1:32925]
> 2014-03-12 15:50:02,750:2029(0x7fc2532d1700):ZOO_INFO@check_events@1750: 
> session establishment complete on server [127.0.0.1:32925], 
> sessionId=0x144b87cfc6c0000, negotiated timeout=10000
> I0312 15:50:02.752624  2051 group.cpp:310] Group process 
> ((1177)@192.168.122.164:46605) connected to ZooKeeper
> I0312 15:50:02.752657  2051 group.cpp:778] Syncing group operations: queue 
> size (joins, cancels, datas) = (1, 0, 0)
> I0312 15:50:02.752666  2051 group.cpp:382] Trying to create path '/mesos' in 
> ZooKeeper
> I0312 15:50:02.770174  2052 contender.cpp:243] New candidate (id='0') has 
> entered the contest for leadership
> I0312 15:50:02.773874  2051 detector.cpp:134] Detected a new leader: (id='0')
> I0312 15:50:02.774001  2051 group.cpp:655] Trying to get 
> '/mesos/info_0000000000' in ZooKeeper
> I0312 15:50:02.778889  2051 detector.cpp:377] A new leading master 
> ([email protected]:10000) is detected
> tests/master_contender_detector_tests.cpp:738: Failure
> Failed to wait 10secs for detected
> I0312 15:50:02.779384  2029 contender.cpp:182] Now cancelling the membership: > 0
> 2014-03-12 15:50:02,780:2029(0x7fc28f738880):ZOO_INFO@zookeeper_close@2505: 
> Closing zookeeper sessionId=0x144b87cfc6c0000 to [127.0.0.1:32925]
> I0312 15:50:02.784046  2029 zookeeper_test_server.cpp:122] Shutdown 
> ZooKeeperTestServer on port 32925
> [  FAILED  ] 
> ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster
>  (55 ms)
> {code}
> Notice that only 55ms has elapsed for this test and the Clock is not paused.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to