Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12049 )
Change subject: IMPALA-4555: Make QueryState's status reporting more robust ...................................................................... Patch Set 5: Code-Review+2 (7 comments) LGTM. Please address the remaining comments. http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/fragment-instance-state.cc File be/src/runtime/fragment-instance-state.cc: http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/fragment-instance-state.cc@259 PS5, Line 259: instance_status->set_report_seq_no(report_seq_no_); Can you please a comment on why we don't advance the seq number here to help readers understand it ? http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/fragment-instance-state.cc@300 PS5, Line 300: on in typo ? http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h File be/src/runtime/query-state.h: http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h@421 PS5, Line 421: /// Returns the amount of time to wait before sending the next status report, calculated in milliseconds http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h@422 PS5, Line 422: exponential This isn't correct, right ? http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h@424 PS5, Line 424: int64_t GetReportWaitTimeMs(); int64_t GetReportWaitTimeMs() const; http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.cc File be/src/runtime/query-state.cc: http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.cc@415 PS5, Line 415: FLAGS_status_report_interval_ms A minor quirk I just noticed. If the user sets FLAGS_status_report_interval_ms to 0, we may want to have a minimum sleep time. Otherwise, if we failed to send the last report, we may retry without sleeping. I guess my earlier suggestion to use the same timeout may have led to this quirk but the current timeout logic still seems simpler overall. http://gerrit.cloudera.org:8080/#/c/12049/5/tests/custom_cluster/test_rpc_timeout.py File tests/custom_cluster/test_rpc_timeout.py: http://gerrit.cloudera.org:8080/#/c/12049/5/tests/custom_cluster/test_rpc_timeout.py@158 PS5, Line 158: # Since the sleep time (1000ms) is much longer than the rpc timeout (100ms), all : # reports will appear to fail. The query is designed to result in many intermediate : # status reports but fewer than the max allowed failures, so the query should succeed. : query_options = {'debug_action': 'REPORT_EXEC_STATUS_DELAY:SLEEP@1000'} This is a good test. Can you please add a test which exercises the stateful reporting part by running some UDFs which may generate errors probabilistically. (e.g. as a scan predicate which calls the UDF). In this case, we will also start accumulating errors in the fragment instance state. -- To view, visit http://gerrit.cloudera.org:8080/12049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971 Gerrit-Change-Number: 12049 Gerrit-PatchSet: 5 Gerrit-Owner: Thomas Marshall <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Thomas Marshall <[email protected]> Gerrit-Comment-Date: Tue, 05 Feb 2019 23:11:51 +0000 Gerrit-HasComments: Yes
