[Impala-ASF-CR] IMPALA-6662: Make stress test resilient to hangs due to client crashes

Sailesh Mukil (Code Review) Thu, 15 Mar 2018 14:55:04 -0700

Hello Lars Volker, Michael Brown, David Knupp,

I'd like you to reexamine a change. Please visit


    http://gerrit.cloudera.org:8080/9635

to look at the new patch set (#5).

Change subject: IMPALA-6662: Make stress test resilient to hangs due to client 
crashes
......................................................................

IMPALA-6662: Make stress test resilient to hangs due to client crashes

The concurrent_select.py process starts multiple sub processes
(called query runners), to run the queries. It also starts 2 threads
called the query producer thread and the query consumer thread. The
query producer thread adds queries to a query queue and the query
consumer thread pulls off the queue and feeds the queries to the
query runners.

The query runner, once it gets queries, does the following:
...
  with _submit_query_lock:
    increment(num_queries_started)
  run_query()    # One runner crashes here.
  increment(num_queries_finished)
...

One of the runners crash inside run_query(), thereby never incrementing
num_queries_finished.

Another thread that's supposed to check for memory leaks
(but actually doesn't), periodically acquires '_submit_query_lock' and
waits for the number of running queries to reach 0 before releasing the
lock.

However, in the above case, the number of running queries will never
reach 0 because one of the query runners hasn't incremented
'num_queries_finished' and exited. Therefore, the poll_mem_usage()
function will hold the lock indefinitely, causing no new queries to be
submitted, nor the stress test to complete running.

This patch fixes the problem by changing the global trackers of
num_queries_started and num_queries_finished, etc. to a per
QueryRunner basis. Anytime we want to find the total number of queries
started/finished/cancelled, etc., we aggregate the values from all the
runners. We synchronize access by adding a new lock called the
_query_runners_lock.

In _wait_for_test_to_finish(), we periodically check if a QueryRunner has
died, and if it has, we make sure to update the num_queries_finished to
num_queries_started, since it may have died before updating the 'finished'
value, and we also count the error.

Also, reformatted small bits of the code and added more comments and debug
logs.

Testing: Ran the stress test with the new patch a few times to make sure
that it's doing what we expect.

TODO: Change the way we obtain query_idx. Also reformat other parts of code
so that it's easier to make future changes.

Change-Id: I10c5dc9b8c2fffc471bac2279e348bc1d1fec3b7
---
M tests/stress/concurrent_select.py
1 file changed, 236 insertions(+), 81 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/9635/5
--
To view, visit http://gerrit.cloudera.org:8080/9635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I10c5dc9b8c2fffc471bac2279e348bc1d1fec3b7
Gerrit-Change-Number: 9635
Gerrit-PatchSet: 5
Gerrit-Owner: Sailesh Mukil <[email protected]>
Gerrit-Reviewer: David Knupp <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Michael Brown <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>

[Impala-ASF-CR] IMPALA-6662: Make stress test resilient to hangs due to client crashes

Reply via email to