[
https://issues.apache.org/jira/browse/FLINK-10367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhijiang updated FLINK-10367:
-----------------------------
Description:
For task failure or canceling, the {{SingleInputGate#releaseAllResources}} will
be invoked before task exits.
In the process of {{SingleInputGate#releaseAllResources}}, we first loop to
release all the input channels, then destroy the {{BufferPool}}. For
{{RemoteInputChannel#releaseAllResources}}, it will return floating buffers to
the {{BufferPool}} {{which assigns this recycled buffer to the other
listeners(RemoteInputChannel}}).
It may exist recursive call in this process. If the listener is already
released before, it will directly recycle this buffer to the {{BufferPool}}
which takes another listener to notify available buffer. The above process may
be invoked repeatedly in recursive way.
If there are many input channels as listeners in the {{BufferPool}}, it will
cause {{StackOverflow}} error because of recursion. And in our testing job, the
scale of 10,000 input channels ever caused this error.
I think of two ways for solving this potential problem:
# When the input channel is released, it should notify the {{BufferPool}} of
unregistering this listener, otherwise it is inconsistent between them.
# {{SingleInputGate}} should destroy the {{BufferPool}} first, then loop to
release all the internal input channels. To do so, all the listeners in
{{BufferPool}} will be removed during destroying, and the input channel will
not have further interactions during {{RemoteInputChannel#releaseAllResources}}.
I prefer the second way to solve this problem, because we do not want to expand
another interface method for removing buffer listener, further currently the
internal data structure in {{BufferPool}} can not support remove a listener
directly.
was:
For task failure or canceling, the {{SingleInputGate#releaseAllResources}} will
be invoked before task exits.
In the process of {{SingleInputGate#releaseAllResources}}, we first loop to
release all the input channels, then destroy the {{BufferPool}}. For
{{RemoteInputChannel#releaseAllResources}}, it will return floating buffers to
the {{BufferPool}} {{which assigns this recycled buffer to the other listeners
(RemoteInputChannel}}).
It may exist recursive call in this process. If the listener is already
released before, it will directly recycle this buffer to the {{BufferPool}}
{{again, then {{BufferPool}} }}takes another listener to notify available
buffer. The above process may be invoked repeatedly in recursive way.
If there are many input channels as listeners in the {{BufferPool}}, it will
cause {{StackOverflow}} error because of recursion. And in our testing job, the
scale of 10,000 input channels ever caused this error.
I think of two ways for solving this potential problem:
# When the input channel is released, it should notify the {{BufferPool}} of
unregistering this listener, otherwise it is inconsistent between them.
# {{SingleInputGate}} should destroy the {{BufferPool}} first, then loop to
release all the internal input channels. To do so, all the listeners in
{{BufferPool}} will be removed during destroying, and the input channel will
not have further interactions during {{RemoteInputChannel#releaseAllResources}}.
I prefer the second way to solve this problem, because we do not want to expand
another interface method for removing buffer listener, further currently the
internal data structure in {{BufferPool}} can not support remove a listener
directly.
> Avoid recursion stack overflow during releasing SingleInputGate
> ---------------------------------------------------------------
>
> Key: FLINK-10367
> URL: https://issues.apache.org/jira/browse/FLINK-10367
> Project: Flink
> Issue Type: Improvement
> Components: Network
> Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.5.3, 1.6.0
> Reporter: zhijiang
> Assignee: zhijiang
> Priority: Minor
>
> For task failure or canceling, the {{SingleInputGate#releaseAllResources}}
> will be invoked before task exits.
> In the process of {{SingleInputGate#releaseAllResources}}, we first loop to
> release all the input channels, then destroy the {{BufferPool}}. For
> {{RemoteInputChannel#releaseAllResources}}, it will return floating buffers
> to the {{BufferPool}} {{which assigns this recycled buffer to the other
> listeners(RemoteInputChannel}}).
> It may exist recursive call in this process. If the listener is already
> released before, it will directly recycle this buffer to the {{BufferPool}}
> which takes another listener to notify available buffer. The above process
> may be invoked repeatedly in recursive way.
> If there are many input channels as listeners in the {{BufferPool}}, it will
> cause {{StackOverflow}} error because of recursion. And in our testing job,
> the scale of 10,000 input channels ever caused this error.
> I think of two ways for solving this potential problem:
> # When the input channel is released, it should notify the {{BufferPool}} of
> unregistering this listener, otherwise it is inconsistent between them.
> # {{SingleInputGate}} should destroy the {{BufferPool}} first, then loop to
> release all the internal input channels. To do so, all the listeners in
> {{BufferPool}} will be removed during destroying, and the input channel will
> not have further interactions during
> {{RemoteInputChannel#releaseAllResources}}.
> I prefer the second way to solve this problem, because we do not want to
> expand another interface method for removing buffer listener, further
> currently the internal data structure in {{BufferPool}} can not support
> remove a listener directly.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)