[
https://issues.apache.org/jira/browse/FLINK-21552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293422#comment-17293422
]
Xintong Song commented on FLINK-21552:
--------------------------------------
BTW, I think this issue only affects when the job tries to reuse a cached slot.
As long as the slot is freed, memory manager will be destroyed and the corrupt
states will be discarded.
Since it doesn't affect other jobs in the cluster, and for the current job the
{{initializer}} is failing anyway, I think we can downgrade it to major
priority.
> The managed memory was not released if exception was thrown in
> createPythonExecutionEnvironment
> -----------------------------------------------------------------------------------------------
>
> Key: FLINK-21552
> URL: https://issues.apache.org/jira/browse/FLINK-21552
> Project: Flink
> Issue Type: Bug
> Components: API / Python
> Affects Versions: 1.12.0
> Reporter: Dian Fu
> Assignee: Dian Fu
> Priority: Critical
> Fix For: 1.13.0, 1.12.3
>
>
> If there is exception thrown in createPythonExecutionEnvironment, the job
> will failed with the following exception:
> {code}
> org.apache.flink.runtime.memory.MemoryAllocationException: Could not created
> the shared memory resource of size 611948962. Not enough memory left to
> reserve from the slot's managed memory.
> at
> org.apache.flink.runtime.memory.MemoryManager.lambda$getSharedMemoryResourceForManagedMemory$5(MemoryManager.java:536)
> at
> org.apache.flink.runtime.memory.SharedResources.createResource(SharedResources.java:126)
> at
> org.apache.flink.runtime.memory.SharedResources.getOrAllocateSharedResource(SharedResources.java:72)
> at
> org.apache.flink.runtime.memory.MemoryManager.getSharedMemoryResourceForManagedMemory(MemoryManager.java:555)
> at
> org.apache.flink.streaming.api.runners.python.beam.BeamPythonFunctionRunner.open(BeamPythonFunctionRunner.java:250)
> at
> org.apache.flink.streaming.api.operators.python.AbstractPythonFunctionOperator.open(AbstractPythonFunctionOperator.java:113)
> at
> org.apache.flink.table.runtime.operators.python.AbstractStatelessFunctionOperator.open(AbstractStatelessFunctionOperator.java:116)
> at
> org.apache.flink.table.runtime.operators.python.scalar.AbstractPythonScalarFunctionOperator.open(AbstractPythonScalarFunctionOperator.java:88)
> at
> org.apache.flink.table.runtime.operators.python.scalar.AbstractRowDataPythonScalarFunctionOperator.open(AbstractRowDataPythonScalarFunctionOperator.java:70)
> at
> org.apache.flink.table.runtime.operators.python.scalar.RowDataPythonScalarFunctionOperator.open(RowDataPythonScalarFunctionOperator.java:59)
> at
> org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:428)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$2(StreamTask.java:543)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:533)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:573)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570)
> at java.lang.Thread.run(Thread.java:834)
> Caused by: org.apache.flink.runtime.memory.MemoryReservationException: Could
> not allocate 611948962 bytes, only 0 bytes are remaining. This usually
> indicates that you are requesting more memory than you have reserved.
> However, when running an old JVM version it can also be caused by slow
> garbage collection. Try to upgrade to Java 8u72 or higher if running on an
> old Java version.
> at
> org.apache.flink.runtime.memory.UnsafeMemoryBudget.reserveMemory(UnsafeMemoryBudget.java:170)
> at
> org.apache.flink.runtime.memory.UnsafeMemoryBudget.reserveMemory(UnsafeMemoryBudget.java:84)
> at
> org.apache.flink.runtime.memory.MemoryManager.reserveMemory(MemoryManager.java:423)
> at
> org.apache.flink.runtime.memory.MemoryManager.lambda$getSharedMemoryResourceForManagedMemory$5(MemoryManager.java:534)
> ... 17 more
> {code}
> The reason is that the reserved managed memory was not added back to the
> MemoryManager when Job failed because of exceptions thrown in
> createPythonExecutionEnvironment. This causes that there is no managed memory
> to allocate during failover.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)