[
https://issues.apache.org/jira/browse/FLINK-13245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883829#comment-16883829
]
Chesnay Schepler edited comment on FLINK-13245 at 7/15/19 10:56 AM:
--------------------------------------------------------------------
With the extensive help of [~azagrebin] we tracked this down to an issue in the
{{PartitionRequestQueue}}, where readers are only properly released if they are
marked as available (i.e., having credit). If a consumer attempted to close a
connection for which the producer has no credit the actual release was skipped.
This issue only surfaced since the new {{BoundedBlockingSubpartition}} only
releases data when all readers have been released, contrary to the previous
implementation.
[~zjwang] [~pnowojski] [~NicoK] [~StephanEwen]
Our proposal would be to modify {{PartitionRequestQueue#userEventTriggered}} as
follows:
{code}
// Cancel the request for the input channel
int size = availableReaders.size();
final NetworkSequenceViewReader toRelease = allReaders.remove(toCancel);
// remove reader from queue of available readers
for (int i = 0; i < size; i++) {
NetworkSequenceViewReader reader = pollAvailableReader();
if (reader != toRelease) {
registerAvailableReader(reader);
}
}
toRelease.releaseAllResources();
markAsReleased(toRelease.getReceiverId());
{code}
was (Author: zentol):
With the extensive help of [~azagrebin] we tracked this down to an issue in the
{{PartitionRequestQueue}}, where readers are only properly released if they are
marked as available (i.e., having credit). If a consumer attempted to close a
connection for which the producer has no credit the actual release was skipped.
This issue only surfaced since the new {{BoundedBlockingSubpartition}} only
releases data when all readers have been released, contrary to the previous
implementation.
[~zjwang][~pnowojski][~NicoK][~StephanEwen]
Our proposal would be to modify {{PartitionRequestQueue#userEventTriggered}} as
follows:
{code}
// Cancel the request for the input channel
int size = availableReaders.size();
final NetworkSequenceViewReader toRelease = allReaders.remove(toCancel);
// remove reader from queue of available readers
for (int i = 0; i < size; i++) {
NetworkSequenceViewReader reader = pollAvailableReader();
if (reader != toRelease) {
registerAvailableReader(reader);
}
}
toRelease.releaseAllResources();
markAsReleased(toRelease.getReceiverId());
{code}
> Network stack is leaking files
> ------------------------------
>
> Key: FLINK-13245
> URL: https://issues.apache.org/jira/browse/FLINK-13245
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Affects Versions: 1.9.0
> Reporter: Chesnay Schepler
> Assignee: zhijiang
> Priority: Blocker
> Fix For: 1.9.0
>
>
> There's file leak in the network stack / shuffle service.
> When running the {{SlotCountExceedingParallelismTest}} on Windows a large
> number of {{.channel}} files continue to reside in a
> {{flink-netty-shuffle-XXX}} directory.
> From what I've gathered so far these files are still being used by a
> {{BoundedBlockingSubpartition}}. The cleanup logic in this class uses
> ref-counting to ensure we don't release data while a reader is still present.
> However, at the end of the job this count has not reached 0, and thus nothing
> is being released.
> The same issue is also present on the {{ResultPartition}} level; the
> {{ReleaseOnConsumptionResultPartition}} also are being released while the
> ref-count is greater than 0.
> Overall it appears like there's some issue with the notifications for
> partitions being consumed.
> It is feasible that this issue has recently caused issues on Travis where the
> build were failing due to a lack of disk space.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)