> On April 15, 2014, 9:12 p.m., Ben Mahler wrote: > > src/log/recover.cpp, line 113 > > <https://reviews.apache.org/r/18600/diff/9/?file=549174#file549174line113> > > > > If we're auto-initializing, shouldn't we 'watch' for the cluster size > > as opposed to the quorum size to ensure we don't get stuck? > > Benjamin Hindman wrote: > Beyond just watching for the cluster size to appear, what happens when > this watch is triggered but before any messages are sent out the replicas die > (or are stopped by an operator)? We don't want the replicas to get into a > completely blocked state so we need some way of either retrying to do the > auto-initialization after some timeouts and all together bailing after some > number of retries.
I thought about this again this morning. First, we CANNOT watch for the cluster size because we don't know if we are in auto-init case or not until we receive responses from replicas. If we are in catch-up case, we don't need all replicas to respond. So I plan to do the following as BenH suggested: add a retry timeout for log recovery (retry the recovery after timeout occurs). - Jie ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/18600/#review40464 ----------------------------------------------------------- On April 4, 2014, 7:12 p.m., Jie Yu wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/18600/ > ----------------------------------------------------------- > > (Updated April 4, 2014, 7:12 p.m.) > > > Review request for mesos, Benjamin Hindman and Ben Mahler. > > > Bugs: MESOS-984 > https://issues.apache.org/jira/browse/MESOS-984 > > > Repository: mesos-git > > > Description > ------- > > See summary. > > > Diffs > ----- > > src/log/log.hpp 6787c80 > src/log/log.cpp 9dd992f > src/log/recover.hpp 634bc06 > src/log/recover.cpp 688da5f > src/tests/log_tests.cpp 4f08927 > > Diff: https://reviews.apache.org/r/18600/diff/ > > > Testing > ------- > > make check > > > Thanks, > > Jie Yu > >
