Thanks Emre for driving this proposal! It's very useful for troubleshooting.
I have a question: The number_of_records_received_by_each_subtask is the total received records, right? I'm not sure whether we should check data skew based on the latest duration period. In the production, I found the the total received records of all subtasks is balanced, but in the each time period, they are skew. For example, a flink job has `group by` or `keyBy` based on hour field. It mean: - In the 0-1 o'clock, subtaskA is busy, the rest of subtasks are idle. - In the 1-2 o'clock, subtaskB is busy, the rest of subtasks are idle. - Next hour, the busy subtask is changed. Looking forward to your opinions~ Best, Rui On Tue, Jan 16, 2024 at 2:05 PM Xuyang <xyzhong...@163.com> wrote: > Hi, Emre. > > > In large-scale production jobs, the phenomenon of data skew often occurs. > Having an metric on the UI that > reflects data skew without the need for manual inspection of each vertex > by clicking on them would be quite cool. > This could help users quickly identify problematic nodes, simplifying > development and operations. > > > I'm mainly curious about two minor points: > 1. How will the colors of vertics with high data skew scores be unified > with existing backpressure and high busyness > colors on the UI? Users should be able to distinguish at a glance which > vertics in the entire job graph is skewed. > 2. Can you tell me that you prefer to unify Data Skew Score and Exception > tab? In my opinion, Data Skew Score is in > the same category as the existing Backpressured and Busy metrics. > > > Looking forward to your reply. > > > > -- > > Best! > Xuyang > > > > > > At 2024-01-16 00:59:57, "Kartoglu, Emre" <kar...@amazon.co.uk.INVALID> > wrote: > >Hello, > > > >I’m opening this thread to discuss a FLIP[1] to make data skew more > visible on Flink Dashboard. > > > >Data skew is currently not as visible as it should be. Users have to > click each operator and check how much data each sub-task is processing and > compare the sub-tasks against each other. This is especially cumbersome and > error-prone for jobs with big job graphs and high parallelism. I’m > proposing this FLIP to improve this. > > > >Kind regards, > >Emre > > > >[1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > > > > > > >