Impala Public Jenkins has submitted this change and it was merged. (
http://gerrit.cloudera.org:8080/12521 )
Change subject: IMPALA-6662: Make stress test resilient to hangs due to client
crashes
......................................................................
IMPALA-6662: Make stress test resilient to hangs due to client crashes
Thanks to Sailesh Mukil for the initial version of this patch.
The concurrent_select.py process starts multiple sub processes
(called query runners), to run the queries. It also starts 2 threads
called the query producer thread and the query consumer thread. The
query producer thread adds queries to a query queue and the query
consumer thread pulls off the queue and feeds the queries to the
query runners.
The query runner, once it gets queries, does the following:
...
with _submit_query_lock:
increment(num_queries_started)
run_query() # One runner crashes here.
increment(num_queries_finished)
...
One of the runners crash inside run_query(), thereby never incrementing
num_queries_finished.
Another thread that's supposed to check for memory leaks
(but actually doesn't), periodically acquires '_submit_query_lock' and
waits for the number of running queries to reach 0 before releasing the
lock.
However, in the above case, the number of running queries will never
reach 0 because one of the query runners hasn't incremented
'num_queries_finished' and exited. Therefore, the poll_mem_usage()
function will hold the lock indefinitely, causing no new queries to be
submitted, nor the stress test to complete running.
This patch fixes the problem by changing the global trackers of
num_queries_started and num_queries_finished, etc. to a per
QueryRunner basis. Anytime we want to find the total number of queries
started/finished/cancelled, etc., we aggregate the values from all the
runners. We synchronize access by adding a new lock called the
_query_runners_lock.
In _wait_for_test_to_finish(), we periodically check if a QueryRunner has
died, and if it has, we make sure to update the num_queries_finished to
num_queries_started, since it may have died before updating the 'finished'
value, and we also count the error.
Other changes:
* Boilerplate code is reduced by storing all metrics in a dictionary
keyed by the metric name, instead of stamping out the code for
10+ variables.
* Added more comments and debug strings
* Reformatted some code.
Testing:
Ran the stress test with the new patch locally and against a cluster.
Change-Id: I525bf13e0f3dd660c0d9f5c2bf6eb292e7ebb8af
Reviewed-on: http://gerrit.cloudera.org:8080/12521
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
---
M tests/stress/concurrent_select.py
1 file changed, 222 insertions(+), 115 deletions(-)
Approvals:
Impala Public Jenkins: Looks good to me, approved; Verified
--
To view, visit http://gerrit.cloudera.org:8080/12521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I525bf13e0f3dd660c0d9f5c2bf6eb292e7ebb8af
Gerrit-Change-Number: 12521
Gerrit-PatchSet: 8
Gerrit-Owner: Tim Armstrong <[email protected]>
Gerrit-Reviewer: David Knupp <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Thomas Marshall <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>