[ 
https://issues.apache.org/jira/browse/FLINK-10367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16627155#comment-16627155
 ] 

zhijiang commented on FLINK-10367:
----------------------------------

Thanks for your feedback, [~NicoK].

For the option 2, if the buffer pool is destroyed first, then during releasing 
input channels, if some un-released input channel still requests floating 
buffers from buffer pool, then it would get {{IllegalStateException}} from 
buffer pool response. In this case, the input channel should ignore this 
special exception in {{onSenderBacklog}} instead of throwing it into handler 
stack to cause spurious exceptions.

I also considered the option 3 you mentioned, it is for the normal case during 
task running. If many input channels are registered as listeners in buffer pool 
first, and then do not need additional floating credits any more because of 
exclusive buffers recycle. In this special case, it also causes many recursive 
calls during {{notifyBufferAvailable}} and may also cause {{stackOverflow}} 
error. To solve this issue, the input channel should not recycle this floating 
buffers internally and the buffer pool would choose another listener to notify. 
But considering the {{notifyBufferAvailable}} is outside of the sync part, it 
seems difficult to realize this inside the buffer pool part.  So I have not 
thought of a good way for this change.

Maybe I can submit a PR for the option 2 if have no other good choices.

> Avoid recursion stack overflow during releasing SingleInputGate
> ---------------------------------------------------------------
>
>                 Key: FLINK-10367
>                 URL: https://issues.apache.org/jira/browse/FLINK-10367
>             Project: Flink
>          Issue Type: Improvement
>          Components: Network
>    Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.5.3, 1.6.0
>            Reporter: zhijiang
>            Assignee: zhijiang
>            Priority: Minor
>
> For task failure or canceling, the {{SingleInputGate#releaseAllResources}} 
> will be invoked before task exits.
> In the process of {{SingleInputGate#releaseAllResources}}, we first loop to 
> release all the input channels, then destroy the {{BufferPool}}.  For 
> {{RemoteInputChannel#releaseAllResources}}, it will return floating buffers 
> to the {{BufferPool}} {{which assigns this recycled buffer to the other 
> listeners(RemoteInputChannel}}). 
> It may exist recursive call in this process. If the listener is already 
> released before, it will directly recycle this buffer to the {{BufferPool}} 
> which takes another listener to notify available buffer. The above process 
> may be invoked repeatedly in recursive way.
> If there are many input channels as listeners in the {{BufferPool}}, it will 
> cause {{StackOverflow}} error because of recursion. And in our testing job, 
> the scale of 10,000 input channels ever caused this error.
> I think of two ways for solving this potential problem:
>  # When the input channel is released, it should notify the {{BufferPool}} of 
> unregistering this listener, otherwise it is inconsistent between them.
>  # {{SingleInputGate}} should destroy the {{BufferPool}} first, then loop to 
> release all the internal input channels. To do so, all the listeners in 
> {{BufferPool}} will be removed during destroying, and the input channel will 
> not have further interactions during 
> {{RemoteInputChannel#releaseAllResources}}.
> I prefer the second way to solve this problem, because we do not want to 
> expand another interface method for removing buffer listener, further 
> currently the internal data structure in {{BufferPool}} can not support 
> remove a listener directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to