[Impala-ASF-CR] IMPALA-6662: Make stress test resilient to hangs due to client crashes

Michael Brown (Code Review) Wed, 14 Mar 2018 18:37:27 -0700

Michael Brown has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9635 )


Change subject: IMPALA-6662: Make stress test resilient to hangs due to client 
crashes
......................................................................


Patch Set 2:

(14 comments)

First pass, will let you respond and do another pass tomorrow.

http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py
File tests/stress/concurrent_select.py:

http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@a268
PS2, Line 268:
Nit: Go ahead and leave extra blank line this in. flake8 prefers 2 spaces 
between classes.


http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@321
PS2, Line 321: acces
Nit: access


http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@339
PS2, Line 339: _record_runner_metrics_before_kill
Can we have a more descriptive name? This happens after a process failed, 
before you evict the process from self._query_runners. Do you think better 
names would be something like _record_runner_metrics_before_del or 
_record_runner_metrics_before_evict ?


http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@368
PS2, Line 368: @property
Any reason you use the property decorator here but nowhere else?


http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@376
PS2, Line 376: num_running = total_started - total_finished
Maybe a nit, but can this exist outside the lock? That is, indented two spaces 
to the left.


http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@382
PS2, Line 382:       total_dequeued = self._past_runners_num_queries_dequeued
             :       for runner in self._query_runners:
             :         total_dequeued += runner._num_queries_dequeued.value
             :     return total_dequeued
return self._total_num_queries_dequed_no_lock()


http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@387
PS2, Line 387:   def _total_num_queries_dequeued_no_lock(self):
             :     """ TODO: Get rid of this function after reformatting how we 
obtain query indices.
             :     """
I was confused about this function being dangerous until I saw that in its 
usage L667, you take _query_runners_lock before calling it. Can you add a 
comment that  _query_runners_lock MUST be taken before calling this?


http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@647
PS2, Line 647: impalad
Cool, this param isn't needed anymore since you bind the query_runner to an 
impalad beforehand.


http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@850
PS2, Line 850:             if runner.proc.exitcode != 0:
While I believe queries that fail with error fail with some low positive 
integer (probably 1), can we inspect this for more interesting values and print 
some diagnostic info to the console? This will help show that we are still 
hitting SIGSEGV for example and getting past it.


http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@860
PS2, Line 860:                 assert runner._num_queries_started.value - \
             :                     runner._num_queries_finished.value == 1
I think this assert can be invalid if more than one runner has crashed. Let's 
say two crash at roughly the same time and then you enter this loop. This 
assertion will break.


http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@867
PS2, Line 867:             assert runner._num_queries_started.value == 
runner._num_queries_finished.value
Same with this.

I thought about this as a solution, but it also has a flaw:

1. Take the value of self._num_queries_started.value - 
self._num.queries_finished.value
2. Count the number of QueryRunner() objects whose proc.exitcode != 0
3. Take value from 1 - value from 2. Assert this value is >= 0.
4. This difference is the amount by which you need to increment 
self._num.queries_finished.value.

The problem is, you can't distinguish which QueryRunner() objects must be 
deleted from self._query_runners.

In any case, I'll leave that comment here and let you ponder it.


http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@877
PS2, Line 877:  == True:
Remove. For Booleans, you can just use "if var". PEP-008 says to do it this way.


http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@1078
PS2, Line 1078:   def _update_from_query_report
Python nit: This is treated as a public method, so it shouldn't have the _ 
prefixed in the name. I concede that there are likely violations of this 
already in other parts of the code.


http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@1079
PS2, Line 1079:     LOG.debug("Updating runtime stats")
Is there a simple way to distinguish this message across different 
QueryRunner() objects? Even id(self) is probably ok.



--
To view, visit http://gerrit.cloudera.org:8080/9635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I10c5dc9b8c2fffc471bac2279e348bc1d1fec3b7
Gerrit-Change-Number: 9635
Gerrit-PatchSet: 2
Gerrit-Owner: Sailesh Mukil <[email protected]>
Gerrit-Reviewer: David Knupp <[email protected]>
Gerrit-Reviewer: Michael Brown <[email protected]>
Gerrit-Comment-Date: Thu, 15 Mar 2018 01:36:23 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6662: Make stress test resilient to hangs due to client crashes

Reply via email to