[
https://issues.apache.org/jira/browse/FLINK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233895#comment-17233895
]
Roman Khachatryan edited comment on FLINK-19852 at 11/17/20, 7:51 PM:
----------------------------------------------------------------------
I took at the code and I think it can be solved by:
# "transferring" memory segments from an old TempBarrier to the new one in
BatchTask.resetAllInputs()
# For that, add TempBarrier.closeForReuse() method, from which return the
segments instead of calling memManager.release()
# In TempBarrier constructor, some memory still has to be allocated because
some segments might have been returned to reader/writer
I don't see a clean way to collect old segments and create a new TB instance
atomically. In between initInputLocalStrategy should be called. Reusing a
TempBarrier instance seems to be error-prone.
WDYT?
Besides that, I have some concerns regarding the issue:
# initInputLocalStrategy() might also allocate memory; Are we sure that
degradation is not caused by this? (e.g. ExternalSorterBuilder.doBuild())
# at least one thread is created - for each TempBarrier - the same question
# How big is the regression, are there any numbers? Critical Priority seems a
bit subjective given that this issue appeared first in 1.10
[~shaomeng.wang], can you maybe clarify this?
was (Author: roman_khachatryan):
I took at the code and I think it can be solved by:
# "transferring" memory segments from an old TempBarrier to the new one in
BatchTask.resetAllInputs()
# For that, add TempBarrier.closeForReuse() method, from which return the
segments instead of calling memManager.release()
# In TempBarrier constructor, some memory still has to be allocated because
some segments might have been returned to reader/writer
I don't see a clean way to collect old segments and create a new TB instance
atomically. In between initInputLocalStrategy should be called. Reusing a
TempBarrier instance seems to be error-prone.
WDYT?
Besides that, I have some concerns regarding the issue:
# initInputLocalStrategy() might also allocate memory; Are we sure that
degradation is not caused by this? (e.g. ExternalSorterBuilder.doBuild())
# at least one thread is created - for each TempBarrier - the same question
# Are there any numbers? Critical Priority seems a bit subjective given that
this issue appeared first in 1.10
[~shaomeng.wang], can you maybe clarify?
> Managed memory released check can block IterativeTask
> -----------------------------------------------------
>
> Key: FLINK-19852
> URL: https://issues.apache.org/jira/browse/FLINK-19852
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.11.0, 1.10.2, 1.12.0, 1.11.1, 1.11.2
> Reporter: shaomeng.wang
> Assignee: Roman Khachatryan
> Priority: Critical
> Attachments: image-2020-10-28-17-48-28-395.png,
> image-2020-10-28-17-48-48-583.png
>
>
> UnsafeMemoryBudget#reserveMemory, called on TempBarrier, needs time to wait
> on GC of all allocated/released managed memory at every iteration.
>
> stack:
> !image-2020-10-28-17-48-48-583.png!
> new TempBarrier in BatchTask
> !image-2020-10-28-17-48-28-395.png!
>
> These will be very slow than before.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)