Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15046 )

Change subject: IMPALA-9296: Move AuxErrorInfo to StatefulStatus
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15046/1/be/src/runtime/runtime-state.h
File be/src/runtime/runtime-state.h:

http://gerrit.cloudera.org:8080/#/c/15046/1/be/src/runtime/runtime-state.h@322
PS1, Line 322: This method clears aux_error_info_
Hmm, I guess I didn't realize this wasn't already the case before this patch. 
Now I'm confused as to what the issue was in IMPALA-9295, since before this 
patch it seems like once a AuxErrorInfoPB has been generated, it will be sent 
with every report.

I think this patch is still the right thing to do though, otherwise if we send 
the same AuxErrorInfoPB multiple times the coordinator will blacklist the 
corresponding executor multiple times even if there was really only one rpc 
failure.


http://gerrit.cloudera.org:8080/#/c/15046/1/be/src/runtime/runtime-state.h@446
PS1, Line 446:   bool reported_aux_error_info_;
Might want to explicitly initialize this to false here for clarity.


http://gerrit.cloudera.org:8080/#/c/15046/1/be/src/runtime/runtime-state.cc
File be/src/runtime/runtime-state.cc:

http://gerrit.cloudera.org:8080/#/c/15046/1/be/src/runtime/runtime-state.cc@316
PS1, Line 316:   if (aux_error_info_ == nullptr && !reported_aux_error_info_) {
I'm not sure that this is the right was to do it, since it means that if a 
backend sees multiple rpc failures in a single query only one will ever be 
reported to the coordinator.

Of course, I've been advocating for being aggressive about blacklisting. 
Suppose there were two rpc failures, then there are two cases here - either 
both rpcs were to the same other executor, in which case the fact that there 
were two failures makes us more confident something is going on with that 
executor and we might actually want to blacklist the executor twice (which will 
just extend the amount of time that it stays blacklisted for), or the two rpcs 
were to different executors, in which case if we only blacklist one of them if 
we then retry the query it may very well fail again.

And even if we do want to stay more conservative about blacklisting, you've 
suggested before (and I agree) that its generally preferable to report as much 
info about errors as we've got, and then centralize the logic for deciding how 
to act on those errors in the coordinator.

Obviously this is all debatable, though.



--
To view, visit http://gerrit.cloudera.org:8080/15046
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iabbb48dd3ab58ba7b76b1ab6979b92d0e25e72e3
Gerrit-Change-Number: 15046
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Thu, 16 Jan 2020 20:02:11 +0000
Gerrit-HasComments: Yes

Reply via email to