> On Jan. 23, 2019, 3:33 p.m., James DeFelice wrote: > > src/resource_provider/storage/provider.cpp > > Lines 1904 (patched) > > <https://reviews.apache.org/r/69812/diff/1/?file=2121403#file2121403line1904> > > > > There are a few other calls in the spec that might return > > `RESOURCE_EXHAUSTED`, which is also mitigated by backoff. Consider adding > > that case as well. > > > > Furthermore, some calls may return `NOT_FOUND`, which may also be > > mitigated by a retry. It's not clear that the SLRP has enough information > > for a retry in every such case. Needs more thought.
Right, there are a couple other error statuses we could consider retry. However, taking `RESOURCE_EXHAUSTED` as an example, its is an retryable error for `CreateVolume` and `ControllerPublishVolume` given some pre-conditions, but not a retryable error for `CreateSnapshot` in the latest CSI spec. In the future we could build up a per-call retry policy that contains a list of retryable errors with their associated pre-conditions. But for now I'm being conservative and sticking with what https://grpc.io/grpc/cpp/namespacegrpc.html#aff1730578c90160528f6a8d67ef5c43b states, as a guideline for general retry. Dropping. Please reopen it if you feel we should address this right now. > On Jan. 23, 2019, 3:33 p.m., James DeFelice wrote: > > src/resource_provider/storage/provider.cpp > > Lines 1916 (patched) > > <https://reviews.apache.org/r/69812/diff/1/?file=2121403#file2121403line1916> > > > > what about a metric for call retries? `resource_providers/<type>.<name>/csi_plugin/rpcs/<rpc>/errors` should be a good approximation. I'll create a follow-up patch for finer-grained error metrics, but probably won't backport it. Is it good enough? - Chun-Hung ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69812/#review212242 ----------------------------------------------------------- On Jan. 23, 2019, 7:10 a.m., Chun-Hung Hsiao wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/69812/ > ----------------------------------------------------------- > > (Updated Jan. 23, 2019, 7:10 a.m.) > > > Review request for mesos, Benjamin Bannier, James DeFelice, Jie Yu, and Jan > Schlicht. > > > Bugs: MESOS-9517 > https://issues.apache.org/jira/browse/MESOS-9517 > > > Repository: mesos > > > Description > ------- > > When the CSI plugin returns a retryable error (i.e., `DEADLINE_EXCEEDED` > or `UNAVAILABLE`) for `CreateVolume` or `DeleteVolume` CSI calls, SLRP > will now retry indefinitely with a random exponential backoff. > > > Diffs > ----- > > src/csi/client.hpp 5d40d54c2abbd03993ce8835d37db23e209c7554 > src/csi/client.cpp 61ed410985099828a2f58b1527ab57daa4b379df > src/resource_provider/storage/provider.hpp > 331f7b785b14b814c2889488effd53f3a48a1b95 > src/resource_provider/storage/provider.cpp > d6e20a549ede189c757ae3ae922ab7cb86d2be2c > > > Diff: https://reviews.apache.org/r/69812/diff/1/ > > > Testing > ------- > > make check > > A unit test will be added later in the chain. > > > Thanks, > > Chun-Hung Hsiao > >
