[Impala-ASF-CR] IMPALA-9138: Classify certain errors as retryable

Thomas Tauber-Marshall (Code Review) Wed, 11 Dec 2019 11:51:46 -0800

Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14882 )


Change subject: IMPALA-9138: Classify certain errors as retryable
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14882/1/common/thrift/Status.thrift
File common/thrift/Status.thrift:

http://gerrit.cloudera.org:8080/#/c/14882/1/common/thrift/Status.thrift@45
PS1, Line 45:   3: optional TStatusProperties status_properties
So after reading your responses on the other review and thinking about it a 
bunch more, I'm coming around to the idea that including this info in the 
Status is the best approach.

My suggestion of doing something similar to the blacklisting patch, eg. 
including this info in the AuxErrorInfoPB or otherwise setting it as a property 
on RuntimeState, doesn't really work because there could be multiple errors 
encountered and if any of them are non-retryable we shouldn't retry it. So, if 
one was retryable and set an AuxErrorInfoPB that indicates retry-ability, when 
the other one that was non-retryable is generated we would have to override 
that info somehow, which would require something like calling 
RuntimeState::SetAuxErrorInfo() everywhere we generate any error Status (not 
realistic) or checking if the overall_status we send in the report matches the 
status that resulted in setting the AuxErrorInfoPB (hacky) or something like 
that.

So, to the extent that my concern is that we don't want to retry things that we 
shouldn't and not retry things that we should, I think that we can alleviate my 
concerns by making the errors one-sided, i.e. I think its much worse to retry 
something that shouldn't be retried rather than to not retry something that 
should be retried, so we should just structure things in a way where we only 
retry if we're sure the error is retryable, and if in the future we 
occasionally have bugs where queries that could be retried sometimes aren't, 
that should be fine (esp. since I imagine that we won't be marking a ton of 
different errors as retry-able)

This will require some changes to this patch:
- change the semantics of MergeStatus() so that if either status is 
non-retryable the merged status is non-retryable.
- change the handling of per-fragment statuses/overall_status in the exec 
status report. Currently, if multiple fragments report errors, only the first 
fragment's error is ever sent to the coordinator and the rest are discarded. 
This means that if the first fragment reports a retryable error and another 
fragment reports a non-retryable error, the query will be incorrectly retried. 
We'll need to do something like call MergeStatus() in 
QueryState::ErrorDuring(Prepare|Execute) or possibly send back each fragment's 
status in the FragmentInstanceExecStatusPB or similar



--
To view, visit http://gerrit.cloudera.org:8080/14882
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8a79b0245cb5d4349aa375e6503bcc5bec3b72a3
Gerrit-Change-Number: 14882
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Wed, 11 Dec 2019 19:50:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9138: Classify certain errors as retryable

Reply via email to