[
https://issues.apache.org/jira/browse/FLINK-21419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301350#comment-17301350
]
Xintong Song commented on FLINK-21419:
--------------------------------------
[~trohrmann] [~kezhuw],
*Concurrent/multiple frees:*
I like the idea that we guard against concurrent frees by default, and hunt
down the problematic cases gradually by failing in dev builds.
In that case, I'd suggest to hunt for not only concurrent frees, but also
multiple frees, since the latter also indicates unclear ownerships.
Another concern is that, failing for all the multiple-free cases may
significantly disturb our build stability. I'd try to avoid doing that two
weeks before the feature freeze. IMO, hunting down the multiple-free issues
should not be release blockers.
To sum up, my proposal is as follows:
# Always guard against concurrent-frees.
# Make it configurable whether to fail for multiple-frees.
## Manually enable it to detect steadily failure cases and fix them.
## Once the steadily failure cases are fixed, enable it for CI to detect
unstable cases.
## Once we are confident that all multiple-frees cases are fixed, enable it
always.
*Safety net for leaking:*
IIUC, the original purpose of using GC cleaner was to rely on both explicit
segment free and GC for releasing the memory, while make sure the memory is
released only once. Given that we are no longer relying on GC for releasing the
memory, I think {{finalize}} is good enough to serve detector.
As for {{MemoryManager#verifyEmpty}}, it is helpful for session cluster cases,
where segments are not properly collected when the task finishes. E.g., a
thread that is not properly cleaned-up may still holds references to the
segments, preventing the segments from being GC-ed. We've seen similar problems
that causes metaspace leaking before.
> Remove GC cleaner mechanism for unsafe memory segments
> ------------------------------------------------------
>
> Key: FLINK-21419
> URL: https://issues.apache.org/jira/browse/FLINK-21419
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Coordination
> Reporter: Xintong Song
> Assignee: Nicholas Jiang
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.13.0
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)