[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

Michael Ho (Code Review) Tue, 05 Feb 2019 15:12:26 -0800

Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )


Change subject: IMPALA-4555: Make QueryState's status reporting more robust
......................................................................


Patch Set 5: Code-Review+2

(7 comments)

LGTM. Please address the remaining comments.

http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/fragment-instance-state.cc
File be/src/runtime/fragment-instance-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/fragment-instance-state.cc@259
PS5, Line 259:     instance_status->set_report_seq_no(report_seq_no_);
Can you please a comment on why we don't advance the seq number here to help 
readers understand it ?


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/fragment-instance-state.cc@300
PS5, Line 300: on in
typo ?


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h
File be/src/runtime/query-state.h:

http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h@421
PS5, Line 421:   /// Returns the amount of time to wait before sending the next 
status report, calculated
in milliseconds


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h@422
PS5, Line 422:  exponential
This isn't correct, right ?


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.h@424
PS5, Line 424: int64_t GetReportWaitTimeMs();
int64_t GetReportWaitTimeMs() const;


http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/5/be/src/runtime/query-state.cc@415
PS5, Line 415: FLAGS_status_report_interval_ms
A minor quirk I just noticed. If the user sets FLAGS_status_report_interval_ms 
to 0, we may want to have a minimum sleep time. Otherwise, if we failed to send 
the last report, we may retry without sleeping.

I guess my earlier suggestion to use the same timeout may have led to this 
quirk but the current timeout logic still seems simpler overall.


http://gerrit.cloudera.org:8080/#/c/12049/5/tests/custom_cluster/test_rpc_timeout.py
File tests/custom_cluster/test_rpc_timeout.py:

http://gerrit.cloudera.org:8080/#/c/12049/5/tests/custom_cluster/test_rpc_timeout.py@158
PS5, Line 158:     # Since the sleep time (1000ms) is much longer than the rpc 
timeout (100ms), all
             :     # reports will appear to fail. The query is designed to 
result in many intermediate
             :     # status reports but fewer than the max allowed failures, so 
the query should succeed.
             :     query_options = {'debug_action': 
'REPORT_EXEC_STATUS_DELAY:SLEEP@1000'}
This is a good test. Can you please add a test which exercises the stateful 
reporting part by running some UDFs which may generate errors 
probabilistically. (e.g. as a scan predicate which calls the UDF). In this 
case, we will also start accumulating errors in the fragment instance state.



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 5
Gerrit-Owner: Thomas Marshall <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Thomas Marshall <[email protected]>
Gerrit-Comment-Date: Tue, 05 Feb 2019 23:11:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4555: Make QueryState's status reporting more robust

Reply via email to