Sailesh Mukil has posted comments on this change. ( http://gerrit.cloudera.org:8080/9635 )
Change subject: IMPALA-6662: Make stress test resilient to hangs due to client crashes ...................................................................... Patch Set 4: (6 comments) > Patch Set 4: > > (6 comments) > > I had a first look and left some comments but I think the approach needs a > more thorough overhaul. It is not clear to me how queries and query-runners > relate to each other and the data structures involved in starting new runners > seem opaque to me. I think it would be easier to understand if we had a > proper multiprocessing.Queue() with work items (each representing a query) > and a pool of workers that consume items from the queue and put results into > a result queue, which then gets processed by the main process. Then the whole > bug would come down to an item being removed from the work queue but no > corresponding item showing up in the result queue. It would also simplify the > locking and computing sums of counters. > > I can see how that could be out of scope for this change. Let me know if > you'd move forward with the current change as is and I'll do another pass. I completely agree that we need to rethink the design of the stress test, however, I've only changed the parts of this patch that would help us unblock stress testing. There are still some parts of the code I'm not clear on, so making the larger change seems like a much bigger effort. It is something we definitely have to get back to at some point in the near future. As for this patch, I'm just considering it a bug fix and not a major refactor. http://gerrit.cloudera.org:8080/#/c/9635/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9635/4//COMMIT_MSG@53 PS4, Line 53: Testing: Ran the stress test with the new patch a few times to make sure > You could add a pause to one of the runners and kill it to test this manual I saw that query runners were dying in the middle (some of them keep dying for some reason and more spin up after that, I don't know why yet). I also saw that the same errors that used to cause the hang before did not cause the hang anymore. http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py File tests/stress/concurrent_select.py: http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@327 PS4, Line 327: # All values below are cumulative. > maybe add a comment to describe how these are populated? Done http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@373 PS4, Line 373: for runner in self._query_runners: > You could just call _total_num_queries_started() and _total_num_queries_fin I'd rather we not add a RLock at this point, since a lot of this script may still be pretty buggy and adding a Recursive Lock would make it much harder for us to find them. http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@429 PS4, Line 429: with self._query_runners_lock: > I don't think holding the lock is necessary here. Python's GIL makes sure t Done http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@528 PS4, Line 528: while self._total_num_queries_started() < self._num_queries_to_run: > It is not clear to me whether a query runner will run a single query and th The relationship of query runners to queries is not clear and undocumented. I did not make an attempt to refactor that as it would have ended up being a much larger change. However, as I understand it, a query runner can run multiple queries (no clear limit on how many). But a lot of them fail randomly (don't know why yet), and as they keep failing, more keep getting spun up. This is something I want to pick up at some point later on when I or someone else has more time. http://gerrit.cloudera.org:8080/#/c/9635/4/tests/stress/concurrent_select.py@548 PS4, Line 548: Process(target=self._start_single_runner, args=(query_runner, )) > This looks hairy to me, can you add a comment explaining how exactly sharin All the values that are shared between processes are between L973-L980. The rest are only process local variables. There are still some in the StressRunner() class that are shared between processes, but they are involved in more complicated parts of the code, so I haven't touched them in this patch. -- To view, visit http://gerrit.cloudera.org:8080/9635 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I10c5dc9b8c2fffc471bac2279e348bc1d1fec3b7 Gerrit-Change-Number: 9635 Gerrit-PatchSet: 4 Gerrit-Owner: Sailesh Mukil <[email protected]> Gerrit-Reviewer: David Knupp <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Michael Brown <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-Comment-Date: Thu, 15 Mar 2018 21:54:33 +0000 Gerrit-HasComments: Yes
