[jira] [Commented] (FLINK-20663) Managed memory may not be released in time when operators use managed memory frequently

Xintong Song (Jira) Thu, 08 Apr 2021 22:16:08 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317657#comment-17317657
 ]


Xintong Song commented on FLINK-20663:
--------------------------------------

I'm not sure about this being a planner bug.

My gut feeling this is some runtime/operator bug that happens to be triggered 
by certain task execution pattern. Changing of the sql query also changes the 
compiled execution plan, thus the bug is not triggered with the new plan.

It seems we cannot easily reproduce your problem. We cannot execute the scripts 
locally, due to missing the "chloe" packages. And even we have the packages, I 
assume we cannot access the data sources anyway.

Would it be possible for you to run the scripts with a debug version of Flink? 
I can provide you a github branch based on 1.12.1, with some additional logs 
that help us understand whats going on with the memory. You would need to 
checkout the branch, manually build and install the package. Would that work 
for you?

> Managed memory may not be released in time when operators use managed memory 
> frequently
> ---------------------------------------------------------------------------------------
>
>                 Key: FLINK-20663
>                 URL: https://issues.apache.org/jira/browse/FLINK-20663
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.12.0
>            Reporter: Caizhi Weng
>            Assignee: Xintong Song
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.12.2, 1.13.0
>
>         Attachments: exception, summary.py, taskmanager.log
>
>
> Some batch operators (like sort merge join or hash aggregate) use managed 
> memory frequently. When these operators are chained together and the cluster 
> load is a bit heavy, it is very likely that the following exception occurs:
> {code:java}
> 2020-12-18 10:04:32
> java.lang.RuntimeException: 
> org.apache.flink.runtime.memory.MemoryAllocationException: Could not allocate 
> 512 pages
>       at 
> org.apache.flink.table.runtime.util.LazyMemorySegmentPool.nextSegment(LazyMemorySegmentPool.java:85)
>       at 
> org.apache.flink.runtime.io.disk.SimpleCollectingOutputView.<init>(SimpleCollectingOutputView.java:49)
>       at 
> org.apache.flink.table.runtime.operators.aggregate.BytesHashMap$RecordArea.<init>(BytesHashMap.java:297)
>       at 
> org.apache.flink.table.runtime.operators.aggregate.BytesHashMap.<init>(BytesHashMap.java:103)
>       at 
> org.apache.flink.table.runtime.operators.aggregate.BytesHashMap.<init>(BytesHashMap.java:90)
>       at LocalHashAggregateWithKeys$209161.open(Unknown Source)
>       at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:401)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$2(StreamTask.java:506)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:92)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:501)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:530)
>       at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547)
>       at java.lang.Thread.run(Thread.java:834)
>       Suppressed: java.lang.NullPointerException
>               at LocalHashAggregateWithKeys$209161.close(Unknown Source)
>               at 
> org.apache.flink.table.runtime.operators.TableStreamOperator.dispose(TableStreamOperator.java:46)
>               at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:739)
>               at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runAndSuppressThrowable(StreamTask.java:719)
>               at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:642)
>               at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:551)
>               ... 3 more
>               Suppressed: java.lang.NullPointerException
>                       at LocalHashAggregateWithKeys$209766.close(Unknown 
> Source)
>                       ... 8 more
> Caused by: org.apache.flink.runtime.memory.MemoryAllocationException: Could 
> not allocate 512 pages
>       at 
> org.apache.flink.runtime.memory.MemoryManager.allocatePages(MemoryManager.java:231)
>       at 
> org.apache.flink.table.runtime.util.LazyMemorySegmentPool.nextSegment(LazyMemorySegmentPool.java:83)
>       ... 13 more
> Caused by: org.apache.flink.runtime.memory.MemoryReservationException: Could 
> not allocate 16777216 bytes, only 9961487 bytes are remaining. This usually 
> indicates that you are requesting more memory than you have reserved. 
> However, when running an old JVM version it can also be caused by slow 
> garbage collection. Try to upgrade to Java 8u72 or higher if running on an 
> old Java version.
>       at 
> org.apache.flink.runtime.memory.UnsafeMemoryBudget.reserveMemory(UnsafeMemoryBudget.java:164)
>       at 
> org.apache.flink.runtime.memory.UnsafeMemoryBudget.reserveMemory(UnsafeMemoryBudget.java:80)
>       at 
> org.apache.flink.runtime.memory.MemoryManager.allocatePages(MemoryManager.java:229)
>       ... 14 more
> {code}
> It seems that this is caused by relying on GC to release managed memory, as 
> {{System.gc()}} may not trigger GC in time. See {{UnsafeMemoryBudget.java}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-20663) Managed memory may not be released in time when operators use managed memory frequently

Reply via email to