Re: [grpc-io] gRPC A6: Retries

'Eric Gribkoff' via grpc.io Tue, 28 Feb 2017 12:24:55 -0800

On Tue, Feb 28, 2017 at 7:37 AM, <[email protected]> wrote:

> Right, let me chip in and continue the discussion from:
> https://github.com/grpc/proposal/pull/12#issuecomment-283063869 here. My
> comments are based on experience building a gRPC-Go interceptor for retries
> and using it in production at Improbable. It's important to note that we're
> pretty heavy users of gRPC (using it across 3 languages: Go, Java and C++),
> as we have quite a few people around (myself included) who are familiar
> with gRPC/Stubby from their prior jobs.
>
> Now, to recap the points:
>
>
> Thank you for the comments. We are trying to keep high-level discussion on
> the email thread (see here) but my responses to your points (b) and (c) are
> below.
>
> > > b) retry logic would benefit a lot from knowing whether the method is
> idempotent or not. II understand that this is supposed to be handled by
> "service configs", but realistically they're really hard to use. Few people
> would put their retry logic in DNS TXT entries, and even fewer people
> operate the gRPC LB protocol. Can we consider adding a .proto option
> (annotation) to Method definitions?
>
> > The re are two concerns here. One is that saying a method is idempotent
> really just means "retry on status codes x y and z". If we pick a canonical
> set of idempotent status codes, we are forcing every service owner to obey
> these semantics if they want to use retries. However, the gRPC status code
> mechanism is very flexible (even allowing services to return UNAVAILABLE
> arbitrarily) so we'd prefer to force service owners to consider the
> semantics of their application and pick a concrete set of status codes
> rather than just flipping an "idempotent" switch. The second concern is
> around the ease of use of the service config. The intent is for the service
> config to be a universally useful mechanism, and we want to avoid just
> baking everything into the proto. Concerns about the delivery mechanism for
> service config shouldn't invalidate its use for encoding retry policy, and
> may be something we have to tackle separately.
>
> I appreciate the push for the *service config, *and given my past SRE
> experience, I totally appreciate it. However, in the open source world
> simple solutions seem to get most traction. I would be hesitant to tie a
> very very important feature (such as retries) to the *service config*
>  adoption.
>
> I understand your concerns around the flexibility of service codes, and I
> wasn't advocating being prescriptive about it. However, I do think that
> having an option inside the .proto is a very valid approach. As a user of
> gRPC for internal and external purposes there are three cases how I use
> .protos as interfaces:
>  * have internal teams use each-other's services, in which case they code
> against code-generated interfaces that come out of .proto files
>  * to our end users provide a set of "published" .proto files and guides
> of how to generate and use them in the language of their choice through
> gRPC code generation (the true power of gRPC).
>  * to our end users provide rich client APIs for any language, in which
> case all bets are off and any thing can be implemented
>
> As such, for both external and internal services, the .proto *is* the
> canonical contract, controlled by the team building the service. Thus
> having something like the following is satisfying the most common use case:
>
> rpc RemoveTag(TagRemoveRequest) returns (google.protobuf.Empty) {
>     option (grpc.extensions) = {
>       retriable_codes: ["UNAVAILABLE", "RESOURCE_EXHAUSTED"]
>     };
>   };
>
>
> > > c) One thing I found out useful "in the wild" is the ability to limit
> the Deadline of the retriable call. For example, the "parent" RPC call
> (user invoked) has a deadline of 5s, but each retriable call only 1s. This
> allows you to skip a "deadlining" server and retry against one that works.
> > This is covered by our hedging policy. There doesn't seem to be any
> reason to cancel the first RPC in your scenario, as it may be just about to
> complete on the server and cancellation implies the work done so far is
> wasted. Instead, hedging allows you to send the first request, wait one
> second, send a second request, and accept the results of whichever
> completes first.
>
> Ok, this makes sense. However, can we make sure that if the second hedged
> request completes before the first one, we make sure that the spec expects
> to CANCEL the first call, so we can potentially free up resources? One
> unfortunate thing of working outside an environment where everything is
> Stubby, is that a lot of the time request handling holds up resources. For
> example, you establish another HTTP1.1 connection to a backend as part of
> serving your RPC. I'll augment my gRPC interceptors for Go to use this.
>
>
Definitely. As soon as one request succeeds, or fails with a fatal status
code (e.g., FAILED_PRECONDITION), the spec requires that remaining requests
be cancelled and the result sent to the client application layer. The Summary
of Retry and Hedging Logic section
<https://github.com/ncteisen/proposal/blob/9136afba3b60f414d2859524794ba4009772b38d/A6.md#summary-of-retry-and-hedging-logic>
outlines this behavior.


Thanks,

Eric


> On Monday, 27 February 2017 16:53:20 UTC, Mark D. Roth wrote:
>>
>> While talking with Craig on Friday, we realized that we need to make the
>> wire protocol a bit stricter in order to implement retries.
>>
>> Currently, the spec allows status to be sent either as part of initial
>> metadata or trailing metadata.  However, as per the When Retries are
>> Valid
>> <https://github.com/ncteisen/proposal/blob/75e08fa10405430e8177cfd91bf63a25ff4ad617/A6.md#when-retries-are-valid>
>>  section
>> of the gRFC, an RFC becomes committed when "the client receives a non-error
>> response (either an explicit OK status or any response message) from the
>> server".  This means that in a case where the server sends a retryable
>> status, if the status is not included in the initial metadata, the client
>> will consider the RPC committed as soon as it receives the initial
>> metadata, even if the only thing sent after that is the trailing metadata
>> that includes the status.  Thus, we need to require that whenever the
>> server sends status without sending any messages, the server should include
>> the status in the initial metadata (and then close the stream without
>> bothering to send trailing metadata) instead of sending both initial
>> metadata and then trailing metadata.
>>
>> Noah, can you please add a note about this to the gRFC?
>>
>> Based on a previously encounted interop problem (see
>> https://github.com/markdroth/grpc/pull/3, which was included in
>> https://github.com/grpc/grpc/pull/7201), I believe that grpc-go already
>> does the right thing here (although Saila and Menghan should confirm
>> that).  However, since that previously encountered problem did not show up
>> with Java or C++, I suspect that those stacks do not do the right thing
>> here.
>>
>> Craig has confirmed that C-core needs to be fixed in this regard, and
>> I've filed https://github.com/grpc/grpc/issues/9883 for that change.
>>
>> Eric and Penn, can you confirm that Java will need to be changed?  I'm
>> hoping that this isn't too invasive of a change, but please let us know if
>> you foresee any problems.
>>
>> Please let me know if anyone has any questions or problems with any of
>> this.  Thanks!
>>
>> On Fri, Feb 10, 2017 at 4:31 PM, ncteisen via grpc.io <
>> [email protected]> wrote:
>>
>>> I've created a gRFC describing the design and implementation plan for
>>> gRPC Retries.
>>>
>>> Take a look at the gRPC on Github
>>> <https://github.com/grpc/proposal/pull/12>.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "grpc.io" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/grpc-io.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/grpc-io/30e29cbc-439c-46c4-b54f-6e97637a0735%40googlegroups.com
>>> <https://groups.google.com/d/msgid/grpc-io/30e29cbc-439c-46c4-b54f-6e97637a0735%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> --
>> Mark D. Roth <[email protected]>
>> Software Engineer
>> Google, Inc.
>>
> --
> You received this message because you are subscribed to the Google Groups "
> grpc.io" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/grpc-io.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/grpc-io/40e2a470-0e4c-4687-b250-a00afc76f38f%40googlegroups.com
> <https://groups.google.com/d/msgid/grpc-io/40e2a470-0e4c-4687-b250-a00afc76f38f%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/CALUXJ7iGroobGc0fbsxYDSr_DSwpNyNsyQLqaFywwHgfFDQgyg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [grpc-io] gRPC A6: Retries

Reply via email to