Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/15046 )
Change subject: IMPALA-9296: Move AuxErrorInfo to StatefulStatus ...................................................................... Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/15046/1/be/src/runtime/runtime-state.h File be/src/runtime/runtime-state.h: http://gerrit.cloudera.org:8080/#/c/15046/1/be/src/runtime/runtime-state.h@322 PS1, Line 322: This method clears aux_error_info_ Hmm, I guess I didn't realize this wasn't already the case before this patch. Now I'm confused as to what the issue was in IMPALA-9295, since before this patch it seems like once a AuxErrorInfoPB has been generated, it will be sent with every report. I think this patch is still the right thing to do though, otherwise if we send the same AuxErrorInfoPB multiple times the coordinator will blacklist the corresponding executor multiple times even if there was really only one rpc failure. http://gerrit.cloudera.org:8080/#/c/15046/1/be/src/runtime/runtime-state.h@446 PS1, Line 446: bool reported_aux_error_info_; Might want to explicitly initialize this to false here for clarity. http://gerrit.cloudera.org:8080/#/c/15046/1/be/src/runtime/runtime-state.cc File be/src/runtime/runtime-state.cc: http://gerrit.cloudera.org:8080/#/c/15046/1/be/src/runtime/runtime-state.cc@316 PS1, Line 316: if (aux_error_info_ == nullptr && !reported_aux_error_info_) { I'm not sure that this is the right was to do it, since it means that if a backend sees multiple rpc failures in a single query only one will ever be reported to the coordinator. Of course, I've been advocating for being aggressive about blacklisting. Suppose there were two rpc failures, then there are two cases here - either both rpcs were to the same other executor, in which case the fact that there were two failures makes us more confident something is going on with that executor and we might actually want to blacklist the executor twice (which will just extend the amount of time that it stays blacklisted for), or the two rpcs were to different executors, in which case if we only blacklist one of them if we then retry the query it may very well fail again. And even if we do want to stay more conservative about blacklisting, you've suggested before (and I agree) that its generally preferable to report as much info about errors as we've got, and then centralize the logic for deciding how to act on those errors in the coordinator. Obviously this is all debatable, though. -- To view, visit http://gerrit.cloudera.org:8080/15046 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iabbb48dd3ab58ba7b76b1ab6979b92d0e25e72e3 Gerrit-Change-Number: 15046 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Comment-Date: Thu, 16 Jan 2020 20:02:11 +0000 Gerrit-HasComments: Yes
