Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12049 )
Change subject: IMPALA-4555: Make QueryState's status reporting more robust ...................................................................... Patch Set 1: (5 comments) http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/coordinator-backend-state.cc File be/src/runtime/coordinator-backend-state.cc: http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/coordinator-backend-state.cc@299 PS1, Line 299: report_seq_no <= instance_stats->last_report_seq_no_ || instance_stats->done_) { Wouldn't this be a problem for intermediate report ? If the coordinator is slow to respond to the RPC, the client side (i.e. the executor) may assume that the intermediate reports were never applied even if it was actually applied by the coordinator. In which case, the new report may come with a larger sequence number which is larger than the old one and that may cause duplicated report. May be it makes sense to keep track of whether a FIS' report was successfully applied at the executor side and bump the sequence number if the last one was applied successfully. Not sure if there is any corner case which may not have been considered in this alternate approach ?! http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/fragment-instance-state.h File be/src/runtime/fragment-instance-state.h: http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/fragment-instance-state.h@99 PS1, Line 99: void SetFinalReportSent() { final_report_sent_ = true; } May help to add comments on the expected caller and on when this is expected to be called ? http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.h File be/src/runtime/query-state.h: http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.h@417 PS1, Line 417: (bool final Comment on the input parameter. http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.cc File be/src/runtime/query-state.cc: http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.cc@58 PS1, Line 58: DEFINE_int32(status_report_max_failures, 3, : "Max number of consecutive failed status reports to allow before cancelling"); I thought we want to use a fixed timeout approach for the maximum retries ? I am okay with max_retries too but I think it may make sense to have larger number of retries than 3. http://gerrit.cloudera.org:8080/#/c/12049/1/be/src/runtime/query-state.cc@324 PS1, Line 324: Try to send the RPC 3 times before failing. Sleep for 100ms between retries. May need to be updated. -- To view, visit http://gerrit.cloudera.org:8080/12049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971 Gerrit-Change-Number: 12049 Gerrit-PatchSet: 1 Gerrit-Owner: Thomas Marshall <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Comment-Date: Mon, 10 Dec 2018 21:56:58 +0000 Gerrit-HasComments: Yes
