[jira] [Commented] (FLINK-13245) Network stack is leaking files

zhijiang (JIRA) Mon, 15 Jul 2019 09:44:08 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885396#comment-16885396
 ]


zhijiang commented on FLINK-13245:
----------------------------------

[~azagrebin] I have not seen your latest comments before I submitting above.

I agree with your above two points. It is not consistent to remove the reader 
from `allReaders` if it is not in `availableReaders` for canceled partition. If 
we do not remove it from `allReader`, it still could be released while handling 
`CloseRequest` from `RemoteInputChannel`. I think the above modifications could 
solve this issue based on existing mechanism.

For the second problem I also ever found it seems a bit strange to release 
partition via circle call/dependency. That means network notification -> 
ResultPartition -> ResultPartitionManager -> ResultPartition -> 
ResultSubpartition. And I also considered integrating these methods of 
`releaseAllResources` and `notifySubpartitionConsumed`. But I am not sure we 
should do this refactoring at this point in release-1.9.  I think it might be 
safe and proper to do this refactoring in next version 1.10. 

At this point we could fix the potential file leak to reuse the previous way 
and make less work.

> Network stack is leaking files
> ------------------------------
>
>                 Key: FLINK-13245
>                 URL: https://issues.apache.org/jira/browse/FLINK-13245
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.9.0
>            Reporter: Chesnay Schepler
>            Assignee: zhijiang
>            Priority: Blocker
>             Fix For: 1.9.0
>
>
> There's file leak in the network stack / shuffle service.
> When running the {{SlotCountExceedingParallelismTest}} on Windows a large 
> number of {{.channel}} files continue to reside in a 
> {{flink-netty-shuffle-XXX}} directory.
> From what I've gathered so far these files are still being used by a 
> {{BoundedBlockingSubpartition}}. The cleanup logic in this class uses 
> ref-counting to ensure we don't release data while a reader is still present. 
> However, at the end of the job this count has not reached 0, and thus nothing 
> is being released.
> The same issue is also present on the {{ResultPartition}} level; the 
> {{ReleaseOnConsumptionResultPartition}} also are being released while the 
> ref-count is greater than 0.
> Overall it appears like there's some issue with the notifications for 
> partitions being consumed.
> It is feasible that this issue has recently caused issues on Travis where the 
> build were failing due to a lack of disk space.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (FLINK-13245) Network stack is leaking files

Reply via email to