[ 
https://issues.apache.org/jira/browse/FLINK-21419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301350#comment-17301350
 ] 

Xintong Song commented on FLINK-21419:
--------------------------------------

[~trohrmann] [~kezhuw],

*Concurrent/multiple frees:*
 I like the idea that we guard against concurrent frees by default, and hunt 
down the problematic cases gradually by failing in dev builds.

In that case, I'd suggest to hunt for not only concurrent frees, but also 
multiple frees, since the latter also indicates unclear ownerships.

Another concern is that, failing for all the multiple-free cases may 
significantly disturb our build stability. I'd try to avoid doing that two 
weeks before the feature freeze. IMO, hunting down the multiple-free issues 
should not be release blockers.

To sum up, my proposal is as follows:
 # Always guard against concurrent-frees.
 # Make it configurable whether to fail for multiple-frees.
 ## Manually enable it to detect steadily failure cases and fix them.
 ## Once the steadily failure cases are fixed, enable it for CI to detect 
unstable cases.
 ## Once we are confident that all multiple-frees cases are fixed, enable it 
always.

*Safety net for leaking:*
 IIUC, the original purpose of using GC cleaner was to rely on both explicit 
segment free and GC for releasing the memory, while make sure the memory is 
released only once. Given that we are no longer relying on GC for releasing the 
memory, I think {{finalize}} is good enough to serve detector.

As for {{MemoryManager#verifyEmpty}}, it is helpful for session cluster cases, 
where segments are not properly collected when the task finishes. E.g., a 
thread that is not properly cleaned-up may still holds references to the 
segments, preventing the segments from being GC-ed. We've seen similar problems 
that causes metaspace leaking before. 

> Remove GC cleaner mechanism for unsafe memory segments
> ------------------------------------------------------
>
>                 Key: FLINK-21419
>                 URL: https://issues.apache.org/jira/browse/FLINK-21419
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>            Reporter: Xintong Song
>            Assignee: Nicholas Jiang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to