[ https://issues.apache.org/jira/browse/FLINK-13245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885832#comment-16885832 ]
zhijiang edited comment on FLINK-13245 at 7/16/19 4:16 AM: ----------------------------------------------------------- We have some previous assumptions that ResultSubpartitionView could be released individually, but all the subpartitions are released together via `ResultPartition/ResultPartitionManager`. After thinking it through, it might be reasonable to have both methods as {{ResultSubpartitionView#notifySubpartitionConsumed}} and {{ResultSubpartitionView#releaseAllResources}}, because they describe the different semantics. * `releaseAllResources` is used for releasing resources from ResultSubpartitionView aspect. The view is created by netty stack which is also responsible for triggering the release. In detail it has two scenarios to trigger release: One case is that netty channel inactive/exception as current {{PartitonRequestQueue#channelInactive}} and {{PartitionRequestQueue#exceptionCaught}} done. The other case is that when ResultSubpartitionView is actually consumed via {{NettyMessage#CannelPartitionRequest}}. * `notifySubpartitionConsumed` only indicates the ResultSubpartition/View consumed via {{CannelPartitionRequest}}, so we should call this method while handling the cancel message. For the case of channel exception/inactive, it does not always indicate the consumption semantic, so we should not call this method as current done in {{PartitionRequestQueue}}. It is up to {{JobMaster}} whether to release partition in the case of channel inactive/exception. For the streaming job if the consumer fails, the {{JobMaster}} would also cancel the producer task to release the whole {{ResultPartition}}. For the batch job of blocking partition, if the consumer TM exits to cause channel inactive, the {{ResultPartition}} might not need to be released. Overall, these two methods seem to decouple the release between {{ResultPartition}} and {{ResultSubpartitionView}}. So it makes sense to keep them as now, as long as we could handle the {{CannelPartitionRequest}} message correctly based on above modifications. [~azagrebin] was (Author: zjwang): We have some previous assumptions that ResultSubpartitionView could be released individually, but all the subpartitions are released together via `ResultPartition/ResultPartitionManager`. After thinking it through, it might be reasonable to have both methods as {{ResultSubpartitionView#notifySubpartitionConsumed}} and {{ResultSubpartitionView#releaseAllResources}}, because they describe the different semantics. * `releaseAllResources` is used for releasing resources from ResultSubpartitionView aspect. The view is created by netty stack which is also responsible for triggering the release. In detail it has two scenarios to trigger release: One case is that netty channel inactive/exception as current {{PartitonRequestQueue#channelInactive}} and {{PartitionRequestQueue#exceptionCaught}} done. The other case is that when ResultSubpartitionView is actually consumed via {{NettyMessage#CannelPartitionRequest}}. * `notifySubpartitionConsumed` only indicates the ResultSubpartition/View consumed via {{CannelPartitionRequest}}, so we should call this method while handling the cancel message. For the case of channel exception/inactive, it does not always indicate the consumption semantic, so we should not call this method as current done in {{PartitionRequestQueue}}. It is up to {{JobMaster}} whether to release partition in the case of channel inactive/exception. For the streaming job if the consumer fails, the {{JobMaster}} would also cancel the producer task to release the whole {{ResultPartition}}. For the batch job of blocking partition, if the consumer TM exits to cause channel inactive, the {{ResultPartition}} might not need to be released. Overall, these two methods seem to decouple the release between {{ResultPartition}} and {{ResultSubpartitionView}}. So it makes sense to keep them as now, as long as we could handle the {{CannelPartitionRequest}} message correctly based on above modifications. > Network stack is leaking files > ------------------------------ > > Key: FLINK-13245 > URL: https://issues.apache.org/jira/browse/FLINK-13245 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.9.0 > Reporter: Chesnay Schepler > Assignee: zhijiang > Priority: Blocker > Fix For: 1.9.0 > > > There's file leak in the network stack / shuffle service. > When running the {{SlotCountExceedingParallelismTest}} on Windows a large > number of {{.channel}} files continue to reside in a > {{flink-netty-shuffle-XXX}} directory. > From what I've gathered so far these files are still being used by a > {{BoundedBlockingSubpartition}}. The cleanup logic in this class uses > ref-counting to ensure we don't release data while a reader is still present. > However, at the end of the job this count has not reached 0, and thus nothing > is being released. > The same issue is also present on the {{ResultPartition}} level; the > {{ReleaseOnConsumptionResultPartition}} also are being released while the > ref-count is greater than 0. > Overall it appears like there's some issue with the notifications for > partitions being consumed. > It is feasible that this issue has recently caused issues on Travis where the > build were failing due to a lack of disk space. -- This message was sent by Atlassian JIRA (v7.6.14#76016)