Michael Brown has posted comments on this change. ( http://gerrit.cloudera.org:8080/9635 )
Change subject: IMPALA-6662: Make stress test resilient to hangs due to client crashes ...................................................................... Patch Set 2: (14 comments) First pass, will let you respond and do another pass tomorrow. http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py File tests/stress/concurrent_select.py: http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@a268 PS2, Line 268: Nit: Go ahead and leave extra blank line this in. flake8 prefers 2 spaces between classes. http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@321 PS2, Line 321: acces Nit: access http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@339 PS2, Line 339: _record_runner_metrics_before_kill Can we have a more descriptive name? This happens after a process failed, before you evict the process from self._query_runners. Do you think better names would be something like _record_runner_metrics_before_del or _record_runner_metrics_before_evict ? http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@368 PS2, Line 368: @property Any reason you use the property decorator here but nowhere else? http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@376 PS2, Line 376: num_running = total_started - total_finished Maybe a nit, but can this exist outside the lock? That is, indented two spaces to the left. http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@382 PS2, Line 382: total_dequeued = self._past_runners_num_queries_dequeued : for runner in self._query_runners: : total_dequeued += runner._num_queries_dequeued.value : return total_dequeued return self._total_num_queries_dequed_no_lock() http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@387 PS2, Line 387: def _total_num_queries_dequeued_no_lock(self): : """ TODO: Get rid of this function after reformatting how we obtain query indices. : """ I was confused about this function being dangerous until I saw that in its usage L667, you take _query_runners_lock before calling it. Can you add a comment that _query_runners_lock MUST be taken before calling this? http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@647 PS2, Line 647: impalad Cool, this param isn't needed anymore since you bind the query_runner to an impalad beforehand. http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@850 PS2, Line 850: if runner.proc.exitcode != 0: While I believe queries that fail with error fail with some low positive integer (probably 1), can we inspect this for more interesting values and print some diagnostic info to the console? This will help show that we are still hitting SIGSEGV for example and getting past it. http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@860 PS2, Line 860: assert runner._num_queries_started.value - \ : runner._num_queries_finished.value == 1 I think this assert can be invalid if more than one runner has crashed. Let's say two crash at roughly the same time and then you enter this loop. This assertion will break. http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@867 PS2, Line 867: assert runner._num_queries_started.value == runner._num_queries_finished.value Same with this. I thought about this as a solution, but it also has a flaw: 1. Take the value of self._num_queries_started.value - self._num.queries_finished.value 2. Count the number of QueryRunner() objects whose proc.exitcode != 0 3. Take value from 1 - value from 2. Assert this value is >= 0. 4. This difference is the amount by which you need to increment self._num.queries_finished.value. The problem is, you can't distinguish which QueryRunner() objects must be deleted from self._query_runners. In any case, I'll leave that comment here and let you ponder it. http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@877 PS2, Line 877: == True: Remove. For Booleans, you can just use "if var". PEP-008 says to do it this way. http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@1078 PS2, Line 1078: def _update_from_query_report Python nit: This is treated as a public method, so it shouldn't have the _ prefixed in the name. I concede that there are likely violations of this already in other parts of the code. http://gerrit.cloudera.org:8080/#/c/9635/2/tests/stress/concurrent_select.py@1079 PS2, Line 1079: LOG.debug("Updating runtime stats") Is there a simple way to distinguish this message across different QueryRunner() objects? Even id(self) is probably ok. -- To view, visit http://gerrit.cloudera.org:8080/9635 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I10c5dc9b8c2fffc471bac2279e348bc1d1fec3b7 Gerrit-Change-Number: 9635 Gerrit-PatchSet: 2 Gerrit-Owner: Sailesh Mukil <[email protected]> Gerrit-Reviewer: David Knupp <[email protected]> Gerrit-Reviewer: Michael Brown <[email protected]> Gerrit-Comment-Date: Thu, 15 Mar 2018 01:36:23 +0000 Gerrit-HasComments: Yes
