[ 
https://issues.apache.org/jira/browse/FLINK-32298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-32298:
-----------------------------------
    Labels: pull-request-available  (was: )

> The outputQueueSize is negative 
> --------------------------------
>
>                 Key: FLINK-32298
>                 URL: https://issues.apache.org/jira/browse/FLINK-32298
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.18.0
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2023-06-09-17-27-46-429.png
>
>
> h1. Backgraound
> The outputQueueSize indicates `The real size of queued output buffers in 
> bytes.`, so it shouldn't be negative. However, it may be negative in some 
> cases.
> h2. How outputQueueSize is generated?
> TotalWrittenBytes: *_BufferWritingResultPartition#totalWrittenBytes_* records 
> how many data is written to ResultPartition.
> TotalSentNumberOfBytes: *_PipelinedSubpartition#totalNumberOfBytes_* records 
> how many data is sent to downstream.
> The outputQueueSize = TotalWrittenBytes - TotalSentNumberOfBytes.
> h1. Bug
> The TotalSentNumberOfBytes may be larger than TotalWrittenBytes due to some 
> data are written to the PipelinedSubpartition without the 
> BufferWritingResultPartition, such as : 
>  # PipelinedSubpartition#finishReadRecoveredState writes the 
> `EndOfChannelStateEvent` even if the unaligned checkpoint is disable
>  # PipelinedSubpartition#addRecovered writes channel state(if the job 
> recovered from unaligned checkpoint, the outputQueueSize is totally wrong)
>  # PipelinedSubpartition#finish writes the `EndOfPartitionEvent`
> !image-2023-06-09-17-27-46-429.png|width=1033,height=296!
>  
> h1. Solution
> PipelinedSubpartition should is written through BufferWritingResultPartition, 
> and all writes should be counted.
>  
> By the way, outputQueueSize doesn't matter because it's just a metric, it 
> doesn't affect data processing. I found this bug because some of our flink 
> scenarios need to use adaptive rebalance (FLINK-31655), I'm developing it in 
> our internal version, which relies on the correct outputQueueSize to select 
> the low pressure channel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to