Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14882 )
Change subject: IMPALA-9138: Classify certain errors as retryable ...................................................................... Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/14882/1/common/thrift/Status.thrift File common/thrift/Status.thrift: http://gerrit.cloudera.org:8080/#/c/14882/1/common/thrift/Status.thrift@45 PS1, Line 45: 3: optional TStatusProperties status_properties So after reading your responses on the other review and thinking about it a bunch more, I'm coming around to the idea that including this info in the Status is the best approach. My suggestion of doing something similar to the blacklisting patch, eg. including this info in the AuxErrorInfoPB or otherwise setting it as a property on RuntimeState, doesn't really work because there could be multiple errors encountered and if any of them are non-retryable we shouldn't retry it. So, if one was retryable and set an AuxErrorInfoPB that indicates retry-ability, when the other one that was non-retryable is generated we would have to override that info somehow, which would require something like calling RuntimeState::SetAuxErrorInfo() everywhere we generate any error Status (not realistic) or checking if the overall_status we send in the report matches the status that resulted in setting the AuxErrorInfoPB (hacky) or something like that. So, to the extent that my concern is that we don't want to retry things that we shouldn't and not retry things that we should, I think that we can alleviate my concerns by making the errors one-sided, i.e. I think its much worse to retry something that shouldn't be retried rather than to not retry something that should be retried, so we should just structure things in a way where we only retry if we're sure the error is retryable, and if in the future we occasionally have bugs where queries that could be retried sometimes aren't, that should be fine (esp. since I imagine that we won't be marking a ton of different errors as retry-able) This will require some changes to this patch: - change the semantics of MergeStatus() so that if either status is non-retryable the merged status is non-retryable. - change the handling of per-fragment statuses/overall_status in the exec status report. Currently, if multiple fragments report errors, only the first fragment's error is ever sent to the coordinator and the rest are discarded. This means that if the first fragment reports a retryable error and another fragment reports a non-retryable error, the query will be incorrectly retried. We'll need to do something like call MergeStatus() in QueryState::ErrorDuring(Prepare|Execute) or possibly send back each fragment's status in the FragmentInstanceExecStatusPB or similar -- To view, visit http://gerrit.cloudera.org:8080/14882 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8a79b0245cb5d4349aa375e6503bcc5bec3b72a3 Gerrit-Change-Number: 14882 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Comment-Date: Wed, 11 Dec 2019 19:50:51 +0000 Gerrit-HasComments: Yes
