Sailesh Mukil has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9635
Change subject: IMPALA-6662: Make stress test resilient to hangs due to client crashes ...................................................................... IMPALA-6662: Make stress test resilient to hangs due to client crashes The concurrent_select.py process starts multiple sub processes (called query runners), to run the queries. It also starts 2 threads called the query producer thread and the query consumer thread. The query producer thread adds queries to a query queue and the query consumer thread pulls off the queue and feeds the queries to the query runners. The query runner, once it gets queries, does the following: ... with _submit_query_lock: increment(num_queries_started) run_query() # One runner crashes here. increment(num_queries_finished) ... One of the runners crash inside run_query(), thereby never incrementing num_queries_finished. Another thread that's supposed to check for memory leaks (but actually doesn't), periodically acquires '_submit_query_lock' and waits for the number of running queries to reach 0 before releasing the lock. However, in the above case, the number of running queries will never reach 0 because one of the query runners hasn't incremented 'num_queries_finished' and exited. Therefore, the poll_mem_usage() function will hold the lock indefinitely, causing no new queries to be submitted, nor the stress test to complete running. This patch fixes the problem by changing the global trackers of num_queries_started and num_queries_finished, etc. to a per QueryRunner basis. Anytime we want to find the total number of queries started/finished/cancelled, etc., we aggregate the values from all the runners. We synchronize access by adding a new lock called the _query_runners_lock. In _wait_for_test_to_finish(), we periodically check if a QueryRunner has died, and if it has, we make sure to update the num_queries_finished to num_queries_started, since it may have died before updating the 'finished' value, and we also count the error. Also, reformatted small bits of the code and added more comments and debug logs. Testing: Ran the stress test with the new patch a few times to make sure that it's doing what we expect. TODO: Change the way we obtain query_idx. Also reformat other parts of code so that it's easier to make future changes. Change-Id: I10c5dc9b8c2fffc471bac2279e348bc1d1fec3b7 --- M tests/stress/concurrent_select.py 1 file changed, 236 insertions(+), 82 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/9635/1 -- To view, visit http://gerrit.cloudera.org:8080/9635 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I10c5dc9b8c2fffc471bac2279e348bc1d1fec3b7 Gerrit-Change-Number: 9635 Gerrit-PatchSet: 1 Gerrit-Owner: Sailesh Mukil <sail...@cloudera.com>