[Impala-ASF-CR] IMPALA-5746: Test case for remote fragments releasing memory

Thomas Tauber-Marshall (Code Review) Thu, 09 Apr 2020 12:04:08 -0700

Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15666 )


Change subject: IMPALA-5746: Test case for remote fragments releasing memory
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15666/1/tests/custom_cluster/test_restart_services.py
File tests/custom_cluster/test_restart_services.py:

http://gerrit.cloudera.org:8080/#/c/15666/1/tests/custom_cluster/test_restart_services.py@240
PS1, Line 240: status_report_max_retry_s
> I can add a separate test to validate this.
Great.

 > On a separate note, I'm not sure why status_report_max_retry_s
 > needs to be 10 minutes?
The thinking when we did this was that if the fragment is done executing and 
all we're doing is keeping a single thread alive that is periodically retrying 
sending the status report (we use exponential backoff based on the number of 
failed reports, so it wouldn't be very frequent) its pretty cheap and better 
than having the backend cancel itself if the coordinator actually is still 
alive.

You make a good point that we could have the fragments get cancelled if the 
coordinator is removed from the cluster membership.

 > Interested in learning more about the case you mentioned though. I
 > think one thing I currently don't understand is what happens when a
 > primary coordinator fails and backend take-over. I seem to recall
 > that coordinators try to "derive" the state of the admission
 > control from the state-store? or from the state of all the
 > backends?
Right, a new coordinator will get the info about cluster load from the 
statestore. However, it won't get any info about the queries that the failed 
coordinator had been running as IMPALA_REQUEST_QUEUE_TOPIC is transient and so 
all of the updates from the failed coordinator will be removed when it 
disconnects from the statestore.

I'm not certain what the autoscaler uses to determine cluster load, but I think 
its based on the executor's debug webui, so it would still see the queries from 
the failed coordinator until the fragments get cancelled. If so, there's a 
potentially bad interaction there where the new coordinator thinks there's no 
load on the cluster and goes ahead and schedules a bunch of new queries while 
the autoscaler sees a much higher load and gets triggered.



--
To view, visit http://gerrit.cloudera.org:8080/15666
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If9fe8309f80f797d205b756ba58219f595aba4e5
Gerrit-Change-Number: 15666
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Thu, 09 Apr 2020 19:03:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5746: Test case for remote fragments releasing memory

Reply via email to