Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

Rui Fan Wed, 31 Jan 2024 19:26:00 -0800

> I was thinking about using the existing numRecordsInPerSecond metric

numRecordsInPerSecond sounds make sense to me, and I think
it's necessary to mention it in the FLIP wiki. It will let other developers
to easily understand. WDYT?


BTW, that's why I ask whether the data skew score means total
receive records.

> this would always give you a score higher than 1, with no way to cap the
score.

Yeah, you are right. max/mean is not a score, it's the data skew multiple.
And I guess max/mean is easier to understand than
Average_absolute_deviation.

> I'm more used to working with percentages. The problem with the max/mean
metric is I wouldn't immediately know whether a score of 300 is bad for
instance.
> Whereas if users saw above 50% as suggested in the FLIP for instance,
they would consider taking action. I'm tempted to push back on this
suggestion. Happy to discuss further, there is a chance I'm not seeing the
downside of the proposed percentage based metric yet. Please let me know.

After I detailed read the FLIP and Average_absolute_deviation, we know
0% is the best, 100% is worst.

I guess it is difficult for users who have not read the documentation to
know the meaning of 50%. We hope that the designed Data skew will
be easy for users to understand without reading or learning a series
of backgrounds.

For example, as you mentioned before, flink has a metric:
numRecordsInPerSecond.
I believe users know what numRecordsInPerSecond means even if they
didn't read any documentation.

Of course, I'm opening for it. I may have missed something. I'd like to
hear
more feedback from the community.

Best,
Rui

On Thu, Feb 1, 2024 at 4:13 AM Kartoglu, Emre <kar...@amazon.co.uk.invalid>
wrote:

> Hi Rui,
>
> " and provide the total and current score in the detailed tab. I didn't
> see the detailed design in the FLIP, would you mind
> improve the design doc? Thanks".
>
> It will essentially be a basic list view similar to the "Checkpoints" tab.
> I only briefly mentioned this in the FLIP because it will be a basic list
> view.
> No problem though, I will update the FLIP.
>
>
> Please find my responses below quotations.
>
> " 1. About the current skew score, I still don't understand how to get
> the list_of_number_of_records_received_by_each_subtask for
> each subtask.
>
> the list_of_number_of_records_received_by_each_subtask of subtask1
> is
>
> total received records of subtask 1 from beginning to now -
> total received records of subtask 1 from beginning to (now - 1min), right?"
>
> Yes, essentially correct. I was thinking about using the existing
> numRecordsInPerSecond metric (see
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/),
> this would give us per second granularity and this would be more
> "current/live" than per minute.
>
>
> "IIUC, you proposed score is between 0% to 100%, and 0% is the best.
> And the 100% is the worst."
>
> Correct.
>
>
> " For data skew, I'm not sure whether a multiple value is more intuitive.
> It means data skew score = max / mean.
>  The data skew score is between 1 and infinity. 1 is the best, and
> the bigger the worse."
>
> I'm not sure I follow you here. Yes, this would always give you a score
> higher than 1, with no way to cap the score.
> I'm more used to working with percentages. The problem with the max/mean
> metric is I wouldn't immediately know whether a score of 300 is bad for
> instance.
> Whereas if users saw above 50% as suggested in the FLIP for instance, they
> would consider taking action. I'm tempted to push back on this suggestion.
> Happy to discuss further, there is a chance I'm not seeing the downside of
> the proposed percentage based metric yet. Please let me know.
>
> Kind regards,
> Emre
>
> On 31/01/2024, 10:57, "Rui Fan" <1996fan...@gmail.com <mailto:
> 1996fan...@gmail.com>> wrote:
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
> Sorry for the late reply.
>
>
>
>
> > So you would have a high data skew while 1 subtask is receiving all the
> data, but on average (say over 1-2 days) data skew would come down to 0
> because all subtasks would have received their portion of the data.
> > I'm inclined to think that the current proposal might still be fair, as
> you do indeed have a skew by definition (but an intentional one). We can
> have a few ways forward:
> >
> > 0) We can keep the behaviour as proposed. My thoughts are that data skew
> is data skew, however intentional it may be. It is not necessarily bad,
> like in your example.
>
>
> It makes sense to me. Flink should show data skew correctly
> regardless of whether the user is intentional or not.
>
>
>
>
> > 1) Show data skew based on the beginning of time (not a live/current
> score).
> I mentioned some downsides to this in the FLIP: If you break or fix your
> data skew recently, the historical data might hide the recent fix/breakage,
> and it is inconsistent with the other metrics shown on the vertices e.g.
> Backpressure/Busy metrics show the live/current score.
> >
> > 2) We can choose not to put data skew score on the vertices on the job
> graph. And instead just use the new proposed Data Skew tab which could show
> live/current skew score and the total data skew score from the beginning of
> job.
>
>
> It makes sense, we can show the current skew score in the DAG WebUI by
> default,
> and provide the total and current score in the detailed tab.
>
>
> I didn't see the detailed design in the FLIP, would you mind
> improve the design doc? Thanks
>
>
> Also, I have 2 questions for now:
>
>
> 1. About the current skew score, I still don't understand how to get
> the list_of_number_of_records_received_by_each_subtask for
> each subtask.
>
>
> the list_of_number_of_records_received_by_each_subtask of subtask1
> is total received records of subtask 1 from beginning to now -
> total received records of subtask 1 from beginning to (now - 1min), right?
>
>
> Note: 1min is an example. 30s or 2min is fine for me.
>
>
> 2. The skew score is percent
>
>
> I'm not sure whether the score shown in percent format is reasonable.
> For busy ratio or backpressure ratio, they are shown in percent format
> is intuitive.
>
>
> IIUC, you proposed score is between 0% to 100%, and 0% is the best.
> And the 100% is the worst.
>
>
> For data skew, I'm not sure whether a multiple value is more intuitive.
> It means data skew score = max / mean.
>
>
> For example, we have 5 subtasks, the received record numbers are
> [10,10, 10, 100, 10].
> data skew score = max / mean = 100 / (140/5) = 100/ 28 = 3.57.
>
>
> The data skew score is between 1 and infinity. 1 is the best, and
> the bigger the worse.
>
>
> Looking forward to your opinions.
>
>
> Best,
> Rui
>
>
> On Tue, Jan 23, 2024 at 6:41 PM Kartoglu, Emre <kar...@amazon.co.uk.inva
> <mailto:kar...@amazon.co.uk.inva>lid>
> wrote:
>
>
> > Hi Krzysztof,
> >
> > Thank you for the feedback! Please find my comments below.
> >
> > 1. Configurability
> >
> > Adding a feature flag / configuration to enable this is still on the
> table
> > as far as I am concerned. However I believe adding a new metric shouldn't
> > warrant a flag/configuration. One might argue that we should have it for
> > showing the metrics on the Flink UI, and I'd appreciate input on this. My
> > default position is to not have a configuration/flag unless there is a
> good
> > reason (e.g. it turns out there is impact on Flink UI for so far unknown
> > reason). This is because the proposed change should only be improving the
> > experience without any unwanted side effect.
> >
> > 2. Metrics
> >
> > I agree the new metrics should be compatible with the rest of the Flink
> > metric reporting mechanism. I will update the FLIP and propose names for
> > the metrics.
> >
> > Kind regards,
> > Emre
> >
> > On 23/01/2024, 10:31, "Krzysztof Dziołak" <kdzio...@live.com <mailto:
> kdzio...@live.com> <mailto:
> > kdzio...@live.com <mailto:kdzio...@live.com>>> wrote:
> >
> >
> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> >
> >
> >
> >
> >
> >
> > Hi Emre,
> >
> >
> > Thank you for driving this proposal. I've got two questions about the
> > extensions to the proposal that are not captured in the FLIP.
> >
> >
> >
> >
> > 1. Configurability - what kind of configuration would you propose to
> > maintain for this feature? Would On/off switch and/or aggregated period
> > length be configurable? Should we capture the toggles in the FLIP ?
> > 2. Metrics - are we planning to emit the skew metric via metric reporters
> > mechanism. Should we capture proposed metric schema in the FLIP ?
> >
> >
> > Kind regards,
> > Krzysztof
> >
> >
> > ________________________________
> > From: Kartoglu, Emre <kar...@amazon.co.uk.inva <mailto:
> kar...@amazon.co.uk.inva> <mailto:
> > kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva>>LID>
> > Sent: Monday, January 15, 2024 4:59 PM
> > To: dev@flink.apache.org <mailto:dev@flink.apache.org> <mailto:
> dev@flink.apache.org <mailto:dev@flink.apache.org>> <
> > dev@flink.apache.org <mailto:dev@flink.apache.org> <mailto:
> dev@flink.apache.org <mailto:dev@flink.apache.org>>>
> > Subject: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard
> >
> >
> > Hello,
> >
> >
> > I’m opening this thread to discuss a FLIP[1] to make data skew more
> > visible on Flink Dashboard.
> >
> >
> > Data skew is currently not as visible as it should be. Users have to
> click
> > each operator and check how much data each sub-task is processing and
> > compare the sub-tasks against each other. This is especially cumbersome
> and
> > error-prone for jobs with big job graphs and high parallelism. I’m
> > proposing this FLIP to improve this.
> >
> >
> > Kind regards,
> > Emre
> >
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> >
> > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> >
> > >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>
>

Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

Reply via email to