[
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Anderson reopened FLINK-12576:
------------------------------------
To reproduce,
git clone --branch backpressure-with-2-TMs
https://github.com/alpinegizmo/flink-playgrounds.git
cd flink-playgrounds/operations-playground
docker-compose build
docker-compose up -d
You will find a job with these 5 operators
(1) kafka -> (2) timestamps / watermarks -> (3) keyBy + backpressure map -> (4)
keyBy + window -> (5) kafka
where #3, the backpressure map, causes severe backpressure every other minute.
The job is running with parallelism of 2 throughout; up until the first keyBy
all the traffic is on the subtasks with 0 as their index.
In this backpressure-with-2-TMs branch there are two TMs each with one slot.
You will observe that all of the output metrics for the 0-index watermarking
subtask rise to 1 during the even-numbered minutes, and fall to 0 during the
odd numbered minutes, as expected.
If I run this with one TM with 2 slots, all of the input metrics for the
backpressure operator are always zero.
In this backpressure-with-2-TMs branch there are 2 single-slot TMs, and here
the input metrics for subtask 0 of the backpressure operator are always 0, but
the input metrics for subtask 1 of that operator rise and fall every minute, as
they should. Thus my conclusion that the local input metrics are still broken.
> inputQueueLength metric does not work for LocalInputChannels
> ------------------------------------------------------------
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics, Runtime / Network
> Affects Versions: 1.6.4, 1.7.2, 1.8.0
> Reporter: Piotr Nowojski
> Assignee: Aitozi
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.9.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes
> when looking for causes of back pressure (If task is back pressuring whole
> Flink job, but there is a data skew and only local input channels are being
> used).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)