Re: [grpc-io] gRPC A6: Retries

michal Tue, 28 Feb 2017 07:37:41 -0800

Right, let me chip in and continue the discussion 
from: https://github.com/grpc/proposal/pull/12#issuecomment-283063869 here. 
My comments are based on experience building a gRPC-Go interceptor for 
retries and using it in production at Improbable. It's important to note 
that we're pretty heavy users of gRPC (using it across 3 languages: Go, 
Java and C++), as we have quite a few people around (myself included) who 
are familiar with gRPC/Stubby from their prior jobs.


Now, to recap the points:


Thank you for the comments. We are trying to keep high-level discussion on 
the email thread (see here) but my responses to your points (b) and (c) are 
below.

> > b) retry logic would benefit a lot from knowing whether the method is 
idempotent or not. II understand that this is supposed to be handled by 
"service configs", but realistically they're really hard to use. Few people 
would put their retry logic in DNS TXT entries, and even fewer people 
operate the gRPC LB protocol. Can we consider adding a .proto option 
(annotation) to Method definitions?

> The re are two concerns here. One is that saying a method is idempotent 
really just means "retry on status codes x y and z". If we pick a canonical 
set of idempotent status codes, we are forcing every service owner to obey 
these semantics if they want to use retries. However, the gRPC status code 
mechanism is very flexible (even allowing services to return UNAVAILABLE 
arbitrarily) so we'd prefer to force service owners to consider the 
semantics of their application and pick a concrete set of status codes 
rather than just flipping an "idempotent" switch. The second concern is 
around the ease of use of the service config. The intent is for the service 
config to be a universally useful mechanism, and we want to avoid just 
baking everything into the proto. Concerns about the delivery mechanism for 
service config shouldn't invalidate its use for encoding retry policy, and 
may be something we have to tackle separately.

I appreciate the push for the *service config, *and given my past SRE 
experience, I totally appreciate it. However, in the open source world 
simple solutions seem to get most traction. I would be hesitant to tie a 
very very important feature (such as retries) to the *service config*
 adoption. 

I understand your concerns around the flexibility of service codes, and I 
wasn't advocating being prescriptive about it. However, I do think that 
having an option inside the .proto is a very valid approach. As a user of 
gRPC for internal and external purposes there are three cases how I use 
.protos as interfaces:
 * have internal teams use each-other's services, in which case they code 
against code-generated interfaces that come out of .proto files
 * to our end users provide a set of "published" .proto files and guides of 
how to generate and use them in the language of their choice through gRPC 
code generation (the true power of gRPC).
 * to our end users provide rich client APIs for any language, in which 
case all bets are off and any thing can be implemented

As such, for both external and internal services, the .proto *is* the 
canonical contract, controlled by the team building the service. Thus 
having something like the following is satisfying the most common use case:

rpc RemoveTag(TagRemoveRequest) returns (google.protobuf.Empty) {
    option (grpc.extensions) = {
      retriable_codes: ["UNAVAILABLE", "RESOURCE_EXHAUSTED"]
    };
  };


> > c) One thing I found out useful "in the wild" is the ability to limit 
the Deadline of the retriable call. For example, the "parent" RPC call 
(user invoked) has a deadline of 5s, but each retriable call only 1s. This 
allows you to skip a "deadlining" server and retry against one that works.
> This is covered by our hedging policy. There doesn't seem to be any 
reason to cancel the first RPC in your scenario, as it may be just about to 
complete on the server and cancellation implies the work done so far is 
wasted. Instead, hedging allows you to send the first request, wait one 
second, send a second request, and accept the results of whichever 
completes first.

Ok, this makes sense. However, can we make sure that if the second hedged 
request completes before the first one, we make sure that the spec expects 
to CANCEL the first call, so we can potentially free up resources? One 
unfortunate thing of working outside an environment where everything is 
Stubby, is that a lot of the time request handling holds up resources. For 
example, you establish another HTTP1.1 connection to a backend as part of 
serving your RPC. I'll augment my gRPC interceptors for Go to use this. 

On Monday, 27 February 2017 16:53:20 UTC, Mark D. Roth wrote:
>
> While talking with Craig on Friday, we realized that we need to make the 
> wire protocol a bit stricter in order to implement retries.
>
> Currently, the spec allows status to be sent either as part of initial 
> metadata or trailing metadata.  However, as per the When Retries are Valid 
> <https://github.com/ncteisen/proposal/blob/75e08fa10405430e8177cfd91bf63a25ff4ad617/A6.md#when-retries-are-valid>
>  section 
> of the gRFC, an RFC becomes committed when "the client receives a non-error 
> response (either an explicit OK status or any response message) from the 
> server".  This means that in a case where the server sends a retryable 
> status, if the status is not included in the initial metadata, the client 
> will consider the RPC committed as soon as it receives the initial 
> metadata, even if the only thing sent after that is the trailing metadata 
> that includes the status.  Thus, we need to require that whenever the 
> server sends status without sending any messages, the server should include 
> the status in the initial metadata (and then close the stream without 
> bothering to send trailing metadata) instead of sending both initial 
> metadata and then trailing metadata.
>
> Noah, can you please add a note about this to the gRFC?
>
> Based on a previously encounted interop problem (see 
> https://github.com/markdroth/grpc/pull/3, which was included in 
> https://github.com/grpc/grpc/pull/7201), I believe that grpc-go already 
> does the right thing here (although Saila and Menghan should confirm 
> that).  However, since that previously encountered problem did not show up 
> with Java or C++, I suspect that those stacks do not do the right thing 
> here.
>
> Craig has confirmed that C-core needs to be fixed in this regard, and I've 
> filed https://github.com/grpc/grpc/issues/9883 for that change.
>
> Eric and Penn, can you confirm that Java will need to be changed?  I'm 
> hoping that this isn't too invasive of a change, but please let us know if 
> you foresee any problems.
>
> Please let me know if anyone has any questions or problems with any of 
> this.  Thanks!
>
> On Fri, Feb 10, 2017 at 4:31 PM, ncteisen via grpc.io <
> grp...@googlegroups.com <javascript:>> wrote:
>
>> I've created a gRFC describing the design and implementation plan for 
>> gRPC Retries.
>>
>> Take a look at the gRPC on Github 
>> <https://github.com/grpc/proposal/pull/12>.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "grpc.io" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to grpc-io+u...@googlegroups.com <javascript:>.
>> To post to this group, send email to grp...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/grpc-io.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/grpc-io/30e29cbc-439c-46c4-b54f-6e97637a0735%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/grpc-io/30e29cbc-439c-46c4-b54f-6e97637a0735%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Mark D. Roth <ro...@google.com <javascript:>>
> Software Engineer
> Google, Inc.
>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To post to this group, send email to grpc-io@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/40e2a470-0e4c-4687-b250-a00afc76f38f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [grpc-io] gRPC A6: Retries

Reply via email to