Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15666 )

Change subject: IMPALA-5746: Test case for remote fragments releasing memory
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15666/1/tests/custom_cluster/test_restart_services.py
File tests/custom_cluster/test_restart_services.py:

http://gerrit.cloudera.org:8080/#/c/15666/1/tests/custom_cluster/test_restart_services.py@240
PS1, Line 240: status_report_max_retry_s
> So its true that IMPALA-2990 fixes the issue that fragments will eventually
So in this specific test, the fragments do "release" the memory.
The /memz page on the executors show that even if the fragments are still 
"running", they have released all their memory.

For these experiments I used the default values of status_report_max_retry_s 
and status_report_interval_ms.

While the query is running and the coordinator is still up, the /memz endpoint 
of the executor shows:

    Query(de48705d080fa79b:43ee179800000000): Reservation=384.00 KB 
ReservationLimit=5.83 GB OtherMemory=3.09 MB Total=3.46 MB Peak=3.46 MB
      Fragment de48705d080fa79b:43ee179800000002: Reservation=384.00 KB 
OtherMemory=3.09 MB Total=3.46 MB Peak=3.46 MB
        HDFS_SCAN_NODE (id=0): Reservation=384.00 KB OtherMemory=3.08 MB 
Total=3.45 MB Peak=3.45 MB
          Exprs: Total=52.00 KB Peak=52.00 KB
        KrpcDataStreamSender (dst_id=1): Total=1.55 KB Peak=1.55 KB

When the coordinator gets killed the fragments release their resources pretty 
much immediately (before the status_report_max_retry_s timeout is hit):

     Query(4f43a6f7b7745fa2:94f3032a00000000): Reservation=0 
ReservationLimit=5.83 GB OtherMemory=0 Total=0 Peak=3.47 MB
      Fragment 4f43a6f7b7745fa2:94f3032a00000002: Reservation=0 OtherMemory=0 
Total=0 Peak=3.47 MB
        HDFS_SCAN_NODE (id=0): Reservation=0 OtherMemory=0 Total=0 Peak=3.46 MB
        KrpcDataStreamSender (dst_id=1): Total=0 Peak=1.55 KB

I can add a separate test to validate this.

On a separate note, I'm not sure why status_report_max_retry_s needs to be 10 
minutes? If the coordinator dies, then is their much benefit to trying to send 
the report for an additional 10 minutes. I guess maybe it's hard to distinguish 
when a coordinator has died vs. when the network is flaky. Maybe if an executor 
receive's a cluster membership update saying the coordinator has died, the 
fragment should just be cancelled immediately?

Interested in learning more about the case you mentioned though. I think one 
thing I currently don't understand is what happens when a primary coordinator 
fails and backend take-over. I seem to recall that coordinators try to "derive" 
the state of the admission control from the state-store? or from the state of 
all the backends?



--
To view, visit http://gerrit.cloudera.org:8080/15666
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If9fe8309f80f797d205b756ba58219f595aba4e5
Gerrit-Change-Number: 15666
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Thu, 09 Apr 2020 00:06:43 +0000
Gerrit-HasComments: Yes

Reply via email to