[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938397#comment-16938397
 ] 

David Anderson commented on FLINK-12576:
----------------------------------------

> Input queue length of both the local and remote channel are not always zero. 
> Did I do something wrong?

No, you did nothing wrong: that's what I see as well. See the screenshots I 
posted showing all of the input metrics in various cases. With 2 single-slot 
TMs, both channels have input queue length that's not always zero. However, 
inPoolUsage is always zero for one of the channels, which I believe is wrong. 

And in the case of one two-slot TM, then both channels are local, and both show 
input queue length (and all other input metrics) that is always zero, which is 
definitely confusing, if not wrong.

If the current behavior is somehow considered "correct" then the documentation 
needs to be updated to explain which of these metrics don't work in the local 
case -- or better, the metrics should be renamed to make it clear what they are 
actually measuring.


> inputQueueLength metric does not work for LocalInputChannels
> ------------------------------------------------------------
>
>                 Key: FLINK-12576
>                 URL: https://issues.apache.org/jira/browse/FLINK-12576
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Metrics, Runtime / Network
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>            Reporter: Piotr Nowojski
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.9.0
>
>         Attachments: Screen Shot 2019-09-24 at 3.11.15 PM.png, Screen Shot 
> 2019-09-24 at 3.13.05 PM.png, Screen Shot 2019-09-24 at 3.22.36 PM.png, 
> Screen Shot 2019-09-24 at 3.22.53 PM.png, 
> flink-1.8-2-single-slot-TMs-input.png, 
> flink-1.8-2-single-slot-TMs-output.png, flink-1.8-input-subtasks.png, 
> flink-1.8-output-subtasks.png, image-2019-09-26-11-34-24-878.png, 
> image-2019-09-26-11-36-06-027.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to