On Tue, Feb 28, 2017 at 7:37 AM, <[email protected]> wrote: > Right, let me chip in and continue the discussion from: > https://github.com/grpc/proposal/pull/12#issuecomment-283063869 here. My > comments are based on experience building a gRPC-Go interceptor for retries > and using it in production at Improbable. It's important to note that we're > pretty heavy users of gRPC (using it across 3 languages: Go, Java and C++), > as we have quite a few people around (myself included) who are familiar > with gRPC/Stubby from their prior jobs. > > Now, to recap the points: > > > Thank you for the comments. We are trying to keep high-level discussion on > the email thread (see here) but my responses to your points (b) and (c) are > below. > > > > b) retry logic would benefit a lot from knowing whether the method is > idempotent or not. II understand that this is supposed to be handled by > "service configs", but realistically they're really hard to use. Few people > would put their retry logic in DNS TXT entries, and even fewer people > operate the gRPC LB protocol. Can we consider adding a .proto option > (annotation) to Method definitions? > > > The re are two concerns here. One is that saying a method is idempotent > really just means "retry on status codes x y and z". If we pick a canonical > set of idempotent status codes, we are forcing every service owner to obey > these semantics if they want to use retries. However, the gRPC status code > mechanism is very flexible (even allowing services to return UNAVAILABLE > arbitrarily) so we'd prefer to force service owners to consider the > semantics of their application and pick a concrete set of status codes > rather than just flipping an "idempotent" switch. The second concern is > around the ease of use of the service config. The intent is for the service > config to be a universally useful mechanism, and we want to avoid just > baking everything into the proto. Concerns about the delivery mechanism for > service config shouldn't invalidate its use for encoding retry policy, and > may be something we have to tackle separately. > > I appreciate the push for the *service config, *and given my past SRE > experience, I totally appreciate it. However, in the open source world > simple solutions seem to get most traction. I would be hesitant to tie a > very very important feature (such as retries) to the *service config* > adoption. > > I understand your concerns around the flexibility of service codes, and I > wasn't advocating being prescriptive about it. However, I do think that > having an option inside the .proto is a very valid approach. As a user of > gRPC for internal and external purposes there are three cases how I use > .protos as interfaces: > * have internal teams use each-other's services, in which case they code > against code-generated interfaces that come out of .proto files > * to our end users provide a set of "published" .proto files and guides > of how to generate and use them in the language of their choice through > gRPC code generation (the true power of gRPC). > * to our end users provide rich client APIs for any language, in which > case all bets are off and any thing can be implemented > > As such, for both external and internal services, the .proto *is* the > canonical contract, controlled by the team building the service. Thus > having something like the following is satisfying the most common use case: > > rpc RemoveTag(TagRemoveRequest) returns (google.protobuf.Empty) { > option (grpc.extensions) = { > retriable_codes: ["UNAVAILABLE", "RESOURCE_EXHAUSTED"] > }; > }; > > > > > c) One thing I found out useful "in the wild" is the ability to limit > the Deadline of the retriable call. For example, the "parent" RPC call > (user invoked) has a deadline of 5s, but each retriable call only 1s. This > allows you to skip a "deadlining" server and retry against one that works. > > This is covered by our hedging policy. There doesn't seem to be any > reason to cancel the first RPC in your scenario, as it may be just about to > complete on the server and cancellation implies the work done so far is > wasted. Instead, hedging allows you to send the first request, wait one > second, send a second request, and accept the results of whichever > completes first. > > Ok, this makes sense. However, can we make sure that if the second hedged > request completes before the first one, we make sure that the spec expects > to CANCEL the first call, so we can potentially free up resources? One > unfortunate thing of working outside an environment where everything is > Stubby, is that a lot of the time request handling holds up resources. For > example, you establish another HTTP1.1 connection to a backend as part of > serving your RPC. I'll augment my gRPC interceptors for Go to use this. > > Definitely. As soon as one request succeeds, or fails with a fatal status code (e.g., FAILED_PRECONDITION), the spec requires that remaining requests be cancelled and the result sent to the client application layer. The Summary of Retry and Hedging Logic section <https://github.com/ncteisen/proposal/blob/9136afba3b60f414d2859524794ba4009772b38d/A6.md#summary-of-retry-and-hedging-logic> outlines this behavior.
Thanks, Eric > On Monday, 27 February 2017 16:53:20 UTC, Mark D. Roth wrote: >> >> While talking with Craig on Friday, we realized that we need to make the >> wire protocol a bit stricter in order to implement retries. >> >> Currently, the spec allows status to be sent either as part of initial >> metadata or trailing metadata. However, as per the When Retries are >> Valid >> <https://github.com/ncteisen/proposal/blob/75e08fa10405430e8177cfd91bf63a25ff4ad617/A6.md#when-retries-are-valid> >> section >> of the gRFC, an RFC becomes committed when "the client receives a non-error >> response (either an explicit OK status or any response message) from the >> server". This means that in a case where the server sends a retryable >> status, if the status is not included in the initial metadata, the client >> will consider the RPC committed as soon as it receives the initial >> metadata, even if the only thing sent after that is the trailing metadata >> that includes the status. Thus, we need to require that whenever the >> server sends status without sending any messages, the server should include >> the status in the initial metadata (and then close the stream without >> bothering to send trailing metadata) instead of sending both initial >> metadata and then trailing metadata. >> >> Noah, can you please add a note about this to the gRFC? >> >> Based on a previously encounted interop problem (see >> https://github.com/markdroth/grpc/pull/3, which was included in >> https://github.com/grpc/grpc/pull/7201), I believe that grpc-go already >> does the right thing here (although Saila and Menghan should confirm >> that). However, since that previously encountered problem did not show up >> with Java or C++, I suspect that those stacks do not do the right thing >> here. >> >> Craig has confirmed that C-core needs to be fixed in this regard, and >> I've filed https://github.com/grpc/grpc/issues/9883 for that change. >> >> Eric and Penn, can you confirm that Java will need to be changed? I'm >> hoping that this isn't too invasive of a change, but please let us know if >> you foresee any problems. >> >> Please let me know if anyone has any questions or problems with any of >> this. Thanks! >> >> On Fri, Feb 10, 2017 at 4:31 PM, ncteisen via grpc.io < >> [email protected]> wrote: >> >>> I've created a gRFC describing the design and implementation plan for >>> gRPC Retries. >>> >>> Take a look at the gRPC on Github >>> <https://github.com/grpc/proposal/pull/12>. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "grpc.io" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/grpc-io. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/grpc-io/30e29cbc-439c-46c4-b54f-6e97637a0735%40googlegroups.com >>> <https://groups.google.com/d/msgid/grpc-io/30e29cbc-439c-46c4-b54f-6e97637a0735%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> >> -- >> Mark D. Roth <[email protected]> >> Software Engineer >> Google, Inc. >> > -- > You received this message because you are subscribed to the Google Groups " > grpc.io" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/grpc-io. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/grpc-io/40e2a470-0e4c-4687-b250-a00afc76f38f%40googlegroups.com > <https://groups.google.com/d/msgid/grpc-io/40e2a470-0e4c-4687-b250-a00afc76f38f%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/grpc-io. To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/CALUXJ7iGroobGc0fbsxYDSr_DSwpNyNsyQLqaFywwHgfFDQgyg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
