[ 
https://issues.apache.org/jira/browse/FLINK-10367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16627115#comment-16627115
 ] 

Nico Kruber commented on FLINK-10367:
-------------------------------------

After taking a brief look at the code, I'd also prefer option 2. However, it 
may lead to spurious exceptions during release (I'm not 100% sure here, please 
check): part of releasing the channel is to close the partition request client 
or remove it from the connection manager. I guess, until this is done, other 
parts of the network _could_ (really?) receive further data and rely on the 
buffer pool being available.
In case of a graceful shutdown after completing a task, this shouldn't happen, 
but in case of an error and the following channel-release, we don't want the 
error from the release to suppress the original one.

If that really is a possibility, there is also this option: (I'm not saying 
this is the one to go with)

3. to let {{notifyBufferAvailable}} also return whether it has actually taken 
over the buffer and not recycle inside these implementations but instead trying 
to re-distribute the {{Buffer}} to other listeners in 
{{LocalBufferPool#recycle}}

> Avoid recursion stack overflow during releasing SingleInputGate
> ---------------------------------------------------------------
>
>                 Key: FLINK-10367
>                 URL: https://issues.apache.org/jira/browse/FLINK-10367
>             Project: Flink
>          Issue Type: Improvement
>          Components: Network
>    Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.5.3, 1.6.0
>            Reporter: zhijiang
>            Assignee: zhijiang
>            Priority: Minor
>
> For task failure or canceling, the {{SingleInputGate#releaseAllResources}} 
> will be invoked before task exits.
> In the process of {{SingleInputGate#releaseAllResources}}, we first loop to 
> release all the input channels, then destroy the {{BufferPool}}.  For 
> {{RemoteInputChannel#releaseAllResources}}, it will return floating buffers 
> to the {{BufferPool}} {{which assigns this recycled buffer to the other 
> listeners(RemoteInputChannel}}). 
> It may exist recursive call in this process. If the listener is already 
> released before, it will directly recycle this buffer to the {{BufferPool}} 
> which takes another listener to notify available buffer. The above process 
> may be invoked repeatedly in recursive way.
> If there are many input channels as listeners in the {{BufferPool}}, it will 
> cause {{StackOverflow}} error because of recursion. And in our testing job, 
> the scale of 10,000 input channels ever caused this error.
> I think of two ways for solving this potential problem:
>  # When the input channel is released, it should notify the {{BufferPool}} of 
> unregistering this listener, otherwise it is inconsistent between them.
>  # {{SingleInputGate}} should destroy the {{BufferPool}} first, then loop to 
> release all the internal input channels. To do so, all the listeners in 
> {{BufferPool}} will be removed during destroying, and the input channel will 
> not have further interactions during 
> {{RemoteInputChannel#releaseAllResources}}.
> I prefer the second way to solve this problem, because we do not want to 
> expand another interface method for removing buffer listener, further 
> currently the internal data structure in {{BufferPool}} can not support 
> remove a listener directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to