[ 
https://issues.apache.org/jira/browse/IMPALA-14605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18054267#comment-18054267
 ] 

ASF subversion and git services commented on IMPALA-14605:
----------------------------------------------------------

Commit c9bfdbb272238a73e95e483c823f6b54f022de0d in impala's branch 
refs/heads/master from Yida Wu
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=c9bfdbb27 ]

IMPALA-14682: Use centralized async cleanup for admission state cleanup

In IMPALA-14605, we added a mechanism to clean up the admission state
asynchronously. This patch refactors all admission state deletions
to use this centralized async method, making it easier to reason
about when admission state is removed and to detect cases where a
query’s admission state is not properly cleared.

Additionally, this refactoring is a necessary step for future
improvements, such as implementing time-based deletion.

Also updated test_admission_state_map_mem_leak to verify the
admission state number using the new global metric
admission-control-service.num-queries as it is more stable than
checking the log.

Tests:
Passed core tests.
Passed exhaustive custom_cluster/test_admission_controller.py test.

Change-Id: I04f46f2e42ec5e50f4dcccb6b73a34a376615ab0
Reviewed-on: http://gerrit.cloudera.org:8080/23873
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Memory leak in global admissiond when dequeuing cancelled queries
> -----------------------------------------------------------------
>
>                 Key: IMPALA-14605
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14605
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.5.0
>            Reporter: Yida Wu
>            Assignee: Yida Wu
>            Priority: Major
>              Labels: admission-control
>
> We have identified a memory leak scenario in the global admissiond. The issue 
> occurs when a query waiting in the admission queue is cancelled due to 
> backpressure failures but is not properly removed from the admission state 
> map during the dequeue process.
> Sequence of Events:
> A GetQueryStatus() call from coord fails due to backpressure in admissiond.
> {code:java}
> I20251203 05:01:47.795506 3938873 status.cc:129] 
> c0476ba9e0acf5c3:012f334b00000000] GetQueryStatus rpc failed: Remote error: 
> Service unavailable: GetQueryStatus request on impala.AdmissionControlService 
> from 127.0.0.6:43351 dropped due to backpressure. The service queue contains 
> 5 items out of a maximum of 2147483647; memory consumption is 68.54 MB.
> {code}
> Consequently, the coord sends a cancel request for the queued query. The 
> CancelAdmission function sets the cancel flag in the admission state, code 
> ref: 
> https://github.com/apache/impala/blob/master/be/src/scheduling/admission-control-service.cc#L282-L289
> {code:java}
> I20251203 05:11:47.975906  104 admission-control-service.cc:284] 
> CancelAdmission: query_id=c0476ba9e0acf5c3:012f334b00000000
> {code}
> The admissiond tries to dequeue the query. It correctly identifies that the 
> query has been cancelled.
> {code:java}
> I20251203 05:11:48.116552  117 admission-controller.cc:2650] Dequeued 
> cancelled query=c0476ba9e0acf5c3:012f334b00000000
> {code}
> The memory leak is located in this dequeue logic. While the admissiond 
> recognizes the query is cancelled, it fails to remove the query entry from 
> the state map before finishing the process.
> https://github.com/apache/impala/blob/master/be/src/scheduling/admission-controller.cc#L2655-L2658



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to