Jean-Daniel Cryans has posted comments on this change.
Change subject: KUDU-1542: in some cases apply will hang.
Patch Set 4:
> (3 comments)
> Let me make sure I understand what's happening here.
> The problem: with AUTO_FLUSH_SYNC, apply() can loop forever if each
> retry throws a PleaseThrottleException. It's not quite looping
> forever though; it's just that there's nothing stopping this
> operation from hitting "bad luck" in every call to apply() and
> getting repeatedly starved by other operations. Is that right?
> If that's accurate, I don't think this is the right approach to
> fixing the problem. Instead, we should prioritize this operation
> (the one that saw a PTE) such that after it waits on the PTE's
> deferred and retries, it cannot be starved by any operation that
> comes afterward.
The intent of KuduSession is that it's not thread-safe, so if you're waiting on
a Deferred, when it clears you should be the only one applying.
> I'm also not sure why we'd want to place an upper bound on the
> number of retry attempts. Ultimately, users only care about
> timeouts and deadlines, as they correspond to SLAs. Put another
> way, an upper bound on retries doesn't correspond to an upper bound
> on wall clock time. Calculating a deadline using getTimeoutMillis()
> when entering apply() and using that as a determinant for when to
> break out of the apply() loop seems better, though avoiding
> starvation altogether using the above is probably ideal.
I agree we should interrogate the RPC to see if it has timed out, that what I
proposed yesterday too. There's no need to add a new "max retries" concept on
top of what we already have.
The only caveat is that we don't always track timeouts on individual RPCs:
* Timeouts are handled differently depending on the flush mode.
* With AUTO_FLUSH_SYNC, the timeout is set on each apply()'d operation.
* With AUTO_FLUSH_BACKGROUND and MANUAL_FLUSH, the timeout is assigned to a
whole batch of
* operations upon flush()'ing. It means that in a situation with a timeout of
500ms and a flush
* interval of 1000ms, an operation can be outstanding for up to 1500ms before
being timed out.
To view, visit http://gerrit.cloudera.org:8080/3787
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Owner: song bruce zhang <zsyuyizh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jdcry...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: song bruce zhang <zsyuyizh...@gmail.com>