Thanks Emre for driving this proposal!

It's very useful for troubleshooting.

I have a question:

The number_of_records_received_by_each_subtask is the
total received records, right?

I'm not sure whether we should check data skew based on
the latest duration period.

In the production, I found the the total received records of
all subtasks is balanced, but in the each time period, they
are skew.

For example, a flink job has `group by` or `keyBy` based on
hour field. It mean:
- In the 0-1 o'clock, subtaskA is busy, the rest of subtasks are idle.
- In the 1-2 o'clock, subtaskB is busy, the rest of subtasks are idle.
- Next hour, the busy subtask is changed.

Looking forward to your opinions~

Best,
Rui

On Tue, Jan 16, 2024 at 2:05 PM Xuyang <xyzhong...@163.com> wrote:

> Hi, Emre.
>
>
> In large-scale production jobs, the phenomenon of data skew often occurs.
> Having an metric on the UI that
> reflects data skew without the need for manual inspection of each vertex
> by clicking on them would be quite cool.
> This could help users quickly identify problematic nodes, simplifying
> development and operations.
>
>
> I'm mainly curious about two minor points:
> 1. How will the colors of vertics with high data skew scores be unified
> with existing backpressure and high busyness
> colors on the UI? Users should be able to distinguish at a glance which
> vertics in the entire job graph is skewed.
> 2. Can you tell me that you prefer to unify Data Skew Score and Exception
> tab? In my opinion, Data Skew Score is in
> the same category as the existing Backpressured and Busy metrics.
>
>
> Looking forward to your reply.
>
>
>
> --
>
>     Best!
>     Xuyang
>
>
>
>
>
> At 2024-01-16 00:59:57, "Kartoglu, Emre" <kar...@amazon.co.uk.INVALID>
> wrote:
> >Hello,
> >
> >I’m opening this thread to discuss a FLIP[1] to make data skew more
> visible on Flink Dashboard.
> >
> >Data skew is currently not as visible as it should be. Users have to
> click each operator and check how much data each sub-task is processing and
> compare the sub-tasks against each other. This is especially cumbersome and
> error-prone for jobs with big job graphs and high parallelism. I’m
> proposing this FLIP to improve this.
> >
> >Kind regards,
> >Emre
> >
> >[1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> >
> >
> >
>

Reply via email to