Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/15666 )
Change subject: IMPALA-5746: Test case for remote fragments releasing memory ...................................................................... Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/15666/1/tests/custom_cluster/test_restart_services.py File tests/custom_cluster/test_restart_services.py: http://gerrit.cloudera.org:8080/#/c/15666/1/tests/custom_cluster/test_restart_services.py@240 PS1, Line 240: status_report_max_retry_s > I can add a separate test to validate this. Great. > On a separate note, I'm not sure why status_report_max_retry_s > needs to be 10 minutes? The thinking when we did this was that if the fragment is done executing and all we're doing is keeping a single thread alive that is periodically retrying sending the status report (we use exponential backoff based on the number of failed reports, so it wouldn't be very frequent) its pretty cheap and better than having the backend cancel itself if the coordinator actually is still alive. You make a good point that we could have the fragments get cancelled if the coordinator is removed from the cluster membership. > Interested in learning more about the case you mentioned though. I > think one thing I currently don't understand is what happens when a > primary coordinator fails and backend take-over. I seem to recall > that coordinators try to "derive" the state of the admission > control from the state-store? or from the state of all the > backends? Right, a new coordinator will get the info about cluster load from the statestore. However, it won't get any info about the queries that the failed coordinator had been running as IMPALA_REQUEST_QUEUE_TOPIC is transient and so all of the updates from the failed coordinator will be removed when it disconnects from the statestore. I'm not certain what the autoscaler uses to determine cluster load, but I think its based on the executor's debug webui, so it would still see the queries from the failed coordinator until the fragments get cancelled. If so, there's a potentially bad interaction there where the new coordinator thinks there's no load on the cluster and goes ahead and schedules a bunch of new queries while the autoscaler sees a much higher load and gets triggered. -- To view, visit http://gerrit.cloudera.org:8080/15666 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If9fe8309f80f797d205b756ba58219f595aba4e5 Gerrit-Change-Number: 15666 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Comment-Date: Thu, 09 Apr 2020 19:03:50 +0000 Gerrit-HasComments: Yes
