[
https://issues.apache.org/jira/browse/FLINK-13245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889022#comment-16889022
]
zhijiang commented on FLINK-13245:
----------------------------------
After confirming the comments from [~Zentol] in PR, I found that for the case
of `SlotCountExceedingParallelismTest` it would not generate
`ReleaseOnConsumptionResultPartition` because the partition is blocking type.
So the reference counter would not be used in `ResultPartition`, and the files
for bounded blocking partition could be released finally via calling
`TaskExecutorGateway#releasePartitions` based on
`RegionPartitionReleaseStrategy`.
The description of this jira ticket might not be accurate. In my local running
this test in Mac system, it has no file leaks after finished. I am not sure why
it has file leaks in windows system and I guess it might be relevant with mmap
internal mechanism in different systems. I would double verify this test in
windows system.
My PR modifications seems only for the case of pipelined partition which is
using `ReleaseOnConsumptionResultPartition`, then the call of
`notifySubpartitionConsumed` would make the reference counter become 0 finally
to trigger release. But for the pipelined partition it is no issues for
persistent file.
> Network stack is leaking files
> ------------------------------
>
> Key: FLINK-13245
> URL: https://issues.apache.org/jira/browse/FLINK-13245
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Affects Versions: 1.9.0
> Reporter: Chesnay Schepler
> Assignee: zhijiang
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.9.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> There's file leak in the network stack / shuffle service.
> When running the {{SlotCountExceedingParallelismTest}} on Windows a large
> number of {{.channel}} files continue to reside in a
> {{flink-netty-shuffle-XXX}} directory.
> From what I've gathered so far these files are still being used by a
> {{BoundedBlockingSubpartition}}. The cleanup logic in this class uses
> ref-counting to ensure we don't release data while a reader is still present.
> However, at the end of the job this count has not reached 0, and thus nothing
> is being released.
> The same issue is also present on the {{ResultPartition}} level; the
> {{ReleaseOnConsumptionResultPartition}} also are being released while the
> ref-count is greater than 0.
> Overall it appears like there's some issue with the notifications for
> partitions being consumed.
> It is feasible that this issue has recently caused issues on Travis where the
> build were failing due to a lack of disk space.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)