[jira] [Comment Edited] (FLINK-13245) Network stack is leaking files

zhijiang (JIRA) Mon, 15 Jul 2019 21:17:42 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885832#comment-16885832
 ]


zhijiang edited comment on FLINK-13245 at 7/16/19 4:16 AM:
-----------------------------------------------------------

We have some previous assumptions that ResultSubpartitionView could be released 
individually, but all the subpartitions are released together via 
`ResultPartition/ResultPartitionManager`.

After thinking it through, it might be reasonable to have both methods as 
{{ResultSubpartitionView#notifySubpartitionConsumed}} and 
{{ResultSubpartitionView#releaseAllResources}}, because they describe the 
different semantics. 
 * `releaseAllResources` is used for releasing resources from 
ResultSubpartitionView aspect. The view is created by netty stack which is also 
responsible for triggering the release. In detail it has two scenarios to 
trigger release: One case is that netty channel inactive/exception as current 
{{PartitonRequestQueue#channelInactive}} and 
{{PartitionRequestQueue#exceptionCaught}} done. The other case is that when 
ResultSubpartitionView is actually consumed via 
{{NettyMessage#CannelPartitionRequest}}.

 * `notifySubpartitionConsumed` only indicates the ResultSubpartition/View 
consumed via {{CannelPartitionRequest}}, so we should call this method while 
handling the cancel message. For the case of channel exception/inactive, it 
does not always indicate the consumption semantic, so we should not call this 
method as current done in {{PartitionRequestQueue}}. It is up to {{JobMaster}} 
whether to release partition in the case of channel inactive/exception. For the 
streaming job if the consumer fails, the {{JobMaster}} would also cancel the 
producer task to release the whole {{ResultPartition}}. For the batch job of 
blocking partition, if the consumer TM exits to cause channel inactive, the 
{{ResultPartition}} might not need to be released.

Overall, these two methods seem to decouple the release between 
{{ResultPartition}} and {{ResultSubpartitionView}}. So it makes sense to keep 
them as now, as long as we could handle the {{CannelPartitionRequest}} message 
correctly based on above modifications. [~azagrebin]

 


was (Author: zjwang):
We have some previous assumptions that ResultSubpartitionView could be released 
individually, but all the subpartitions are released together via 
`ResultPartition/ResultPartitionManager`.

After thinking it through, it might be reasonable to have both methods as 
{{ResultSubpartitionView#notifySubpartitionConsumed}} and 
{{ResultSubpartitionView#releaseAllResources}}, because they describe the 
different semantics. 
 * `releaseAllResources` is used for releasing resources from 
ResultSubpartitionView aspect. The view is created by netty stack which is also 
responsible for triggering the release. In detail it has two scenarios to 
trigger release: One case is that netty channel inactive/exception as current 
{{PartitonRequestQueue#channelInactive}} and 
{{PartitionRequestQueue#exceptionCaught}} done. The other case is that when 
ResultSubpartitionView is actually consumed via 
{{NettyMessage#CannelPartitionRequest}}.

 * `notifySubpartitionConsumed` only indicates the ResultSubpartition/View 
consumed via {{CannelPartitionRequest}}, so we should call this method while 
handling the cancel message. For the case of channel exception/inactive, it 
does not always indicate the consumption semantic, so we should not call this 
method as current done in {{PartitionRequestQueue}}. It is up to {{JobMaster}} 
whether to release partition in the case of channel inactive/exception. For the 
streaming job if the consumer fails, the {{JobMaster}} would also cancel the 
producer task to release the whole {{ResultPartition}}. For the batch job of 
blocking partition, if the consumer TM exits to cause channel inactive, the 
{{ResultPartition}} might not need to be released.

Overall, these two methods seem to decouple the release between 
{{ResultPartition}} and {{ResultSubpartitionView}}. So it makes sense to keep 
them as now, as long as we could handle the {{CannelPartitionRequest}} message 
correctly based on above modifications.

 

> Network stack is leaking files
> ------------------------------
>
>                 Key: FLINK-13245
>                 URL: https://issues.apache.org/jira/browse/FLINK-13245
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.9.0
>            Reporter: Chesnay Schepler
>            Assignee: zhijiang
>            Priority: Blocker
>             Fix For: 1.9.0
>
>
> There's file leak in the network stack / shuffle service.
> When running the {{SlotCountExceedingParallelismTest}} on Windows a large 
> number of {{.channel}} files continue to reside in a 
> {{flink-netty-shuffle-XXX}} directory.
> From what I've gathered so far these files are still being used by a 
> {{BoundedBlockingSubpartition}}. The cleanup logic in this class uses 
> ref-counting to ensure we don't release data while a reader is still present. 
> However, at the end of the job this count has not reached 0, and thus nothing 
> is being released.
> The same issue is also present on the {{ResultPartition}} level; the 
> {{ReleaseOnConsumptionResultPartition}} also are being released while the 
> ref-count is greater than 0.
> Overall it appears like there's some issue with the notifications for 
> partitions being consumed.
> It is feasible that this issue has recently caused issues on Travis where the 
> build were failing due to a lack of disk space.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (FLINK-13245) Network stack is leaking files

Reply via email to