> On Feb. 28, 2018, 10:04 a.m., Greg Mann wrote:
> > src/docker/executor.cpp
> > Lines 242-247 (patched)
> > <https://reviews.apache.org/r/65759/diff/3/?file=1965994#file1965994line242>
> >
> > Doesn't this render the `onFailed` callback registered on L357 useless?
> > i.e., the `inspect` future will never transition to the failed state?
> >
> > If so, we should probably either remove the `onFailed` callback, or do
> > this instead:
> > ```
> > if (!future.hasDiscard()) {
> > return Break(future);
> > }
> > ```
> >
> > I guess the question is: if the inspect call actually fails, rather
> > than hanging, do we want to retry? It looks like there are several cases in
> > `Docker::inspect` which will result in a failure (failed to create
> > subprocess, failed to read stdout, etc..), and it looks to me like we could
> > probably just retry in those cases. WDYT?
Good point! If a docker daemon returns non-zero, the docker libray will retry
`inspect`, then we'll get a message kile:
`I0228 17:28:13.275115 3248 docker.cpp:1369] Retrying inspect with non-zero
status code. cmd: 'docker -H unix:///var/run/docker.sock inspect
mesos-210b988c-c808-47e5-af65-75f40269755b', interval: 500ms`
But if the docker library returns a failure itself due to some severe bug
(failed to create subprocess, failed to read stdout, etc...), then IMO we
should stop retrying `inspect`:
```
[](const Future<Docker::Container>& future)
-> Future<ControlFlow<Docker::Container>> {
if (future.isReady()) {
return Break(future.get());
}
if (future.isFailed()) {
return Failure(future.failure());
}
return Continue();
});
```
- Andrei
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65759/#review198382
-----------------------------------------------------------
On Feb. 22, 2018, 8:32 p.m., Andrei Budnik wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65759/
> -----------------------------------------------------------
>
> (Updated Feb. 22, 2018, 8:32 p.m.)
>
>
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and
> Michael Park.
>
>
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
>
>
> Repository: mesos
>
>
> Description
> -------
>
> This patch adds retries for `inspect` command to workaround docker
> daemon hangs. We assume that the docker daemon can be temporarily
> unresponsive. If it's unresponsive, then any started docker cli
> command hangs. To address the issue, we retry `inspect` in the loop.
>
>
> Diffs
> -----
>
> src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535
>
>
> Diff: https://reviews.apache.org/r/65759/diff/4/
>
>
> Testing
> -------
>
> internal CI
>
> Manually, described in /r/65713
>
>
> Thanks,
>
> Andrei Budnik
>
>