Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

Rui Fan Wed, 07 Feb 2024 00:32:20 -0800

Thanks Emre for the feedback!

I still think max/mean is more simple and easy to understand
for users. But I don’t have a strong opinion about it.


This proposal is absolutely useful for flink users! In order to
ensure the value for users, would you mind if we wait for
a while and check if there is more feedback from the community.
Also, would you mind sharing these 2 solutions to the
user[1] & user-zh[2] mail list as well? Flink users may give some
valuable feedback there, thanks~

[1] u...@flink.apache.org
[2] user...@flink.apache.org

Best,
Rui

On Thu, Feb 1, 2024 at 5:52 PM Kartoglu, Emre <kar...@amazon.co.uk.invalid>
wrote:

> Hi Rui,
>
> Thanks for the useful feedback and caring about the user experience.
> I will update the FLIP based on 1 comment. I consider this a minor update.
>
> Please find my detailed responses below.
>
> "numRecordsInPerSecond sounds make sense to me, and I think
> it's necessary to mention it in the FLIP wiki. It will let other developers
> to easily understand. WDYT?"
>
> I feel like this might be touching implementation details. No objections
> though,
>  I will update the FLIP with this as one of the ways in which we can
> achieve the proposal.
>
>
> "After I detailed read the FLIP and Average_absolute_deviation, we know
> 0% is the best, 100% is worst."
>
> Correct.
>
>
> "I guess it is difficult for users who have not read the documentation to
> know the meaning of 50%. We hope that the designed Data skew will
> be easy for users to understand without reading or learning a series
> of backgrounds."
>
> I think I understand where you're coming from. My thought is that the user
> won't have to
> know exactly how the skew percentage/score is calculated. But this score
> will
> act as a warning sign for them. Upon seeing a skew score of 80% for an
> operator, as a user
> I will go and click on the operator to see many of my subtasks are not
> receiving any data at all.
> So it acts as a metric to get the user's attention to the skewed operator
> and fix issues.
>
>
> "For example, as you mentioned before, flink has a metric:
> numRecordsInPerSecond.
> I believe users know what numRecordsInPerSecond means even if they
> didn't read any documentation."
>
> The FLIP suggests that we will provide an explanation of the data skew
> score
> under the proposed Data Skew tab. I would like the exact wording to be
> left to
> the code review process to prevent these from blocking the implementation
> work/progress.
> This will be a user-friendly explanation with an option for the curious
> user to see the exact formula.
>
>
> Kind regards,
> Emre
>
>
> On 01/02/2024, 03:26, "Rui Fan" <1996fan...@gmail.com <mailto:
> 1996fan...@gmail.com>> wrote:
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
> > I was thinking about using the existing numRecordsInPerSecond metric
>
>
> numRecordsInPerSecond sounds make sense to me, and I think
> it's necessary to mention it in the FLIP wiki. It will let other developers
> to easily understand. WDYT?
>
>
> BTW, that's why I ask whether the data skew score means total
> receive records.
>
>
> > this would always give you a score higher than 1, with no way to cap the
> score.
>
>
> Yeah, you are right. max/mean is not a score, it's the data skew multiple.
> And I guess max/mean is easier to understand than
> Average_absolute_deviation.
>
>
> > I'm more used to working with percentages. The problem with the max/mean
> metric is I wouldn't immediately know whether a score of 300 is bad for
> instance.
> > Whereas if users saw above 50% as suggested in the FLIP for instance,
> they would consider taking action. I'm tempted to push back on this
> suggestion. Happy to discuss further, there is a chance I'm not seeing the
> downside of the proposed percentage based metric yet. Please let me know.
>
>
> After I detailed read the FLIP and Average_absolute_deviation, we know
> 0% is the best, 100% is worst.
>
>
> I guess it is difficult for users who have not read the documentation to
> know the meaning of 50%. We hope that the designed Data skew will
> be easy for users to understand without reading or learning a series
> of backgrounds.
>
>
> For example, as you mentioned before, flink has a metric:
> numRecordsInPerSecond.
> I believe users know what numRecordsInPerSecond means even if they
> didn't read any documentation.
>
>
> Of course, I'm opening for it. I may have missed something. I'd like to
> hear
> more feedback from the community.
>
>
> Best,
> Rui
>
>
> On Thu, Feb 1, 2024 at 4:13 AM Kartoglu, Emre <kar...@amazon.co.uk.inva
> <mailto:kar...@amazon.co.uk.inva>lid>
> wrote:
>
>
> > Hi Rui,
> >
> > " and provide the total and current score in the detailed tab. I didn't
> > see the detailed design in the FLIP, would you mind
> > improve the design doc? Thanks".
> >
> > It will essentially be a basic list view similar to the "Checkpoints"
> tab.
> > I only briefly mentioned this in the FLIP because it will be a basic list
> > view.
> > No problem though, I will update the FLIP.
> >
> >
> > Please find my responses below quotations.
> >
> > " 1. About the current skew score, I still don't understand how to get
> > the list_of_number_of_records_received_by_each_subtask for
> > each subtask.
> >
> > the list_of_number_of_records_received_by_each_subtask of subtask1
> > is
> >
> > total received records of subtask 1 from beginning to now -
> > total received records of subtask 1 from beginning to (now - 1min),
> right?"
> >
> > Yes, essentially correct. I was thinking about using the existing
> > numRecordsInPerSecond metric (see
> > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/ <
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/>),
> > this would give us per second granularity and this would be more
> > "current/live" than per minute.
> >
> >
> > "IIUC, you proposed score is between 0% to 100%, and 0% is the best.
> > And the 100% is the worst."
> >
> > Correct.
> >
> >
> > " For data skew, I'm not sure whether a multiple value is more intuitive.
> > It means data skew score = max / mean.
> > The data skew score is between 1 and infinity. 1 is the best, and
> > the bigger the worse."
> >
> > I'm not sure I follow you here. Yes, this would always give you a score
> > higher than 1, with no way to cap the score.
> > I'm more used to working with percentages. The problem with the max/mean
> > metric is I wouldn't immediately know whether a score of 300 is bad for
> > instance.
> > Whereas if users saw above 50% as suggested in the FLIP for instance,
> they
> > would consider taking action. I'm tempted to push back on this
> suggestion.
> > Happy to discuss further, there is a chance I'm not seeing the downside
> of
> > the proposed percentage based metric yet. Please let me know.
> >
> > Kind regards,
> > Emre
> >
> > On 31/01/2024, 10:57, "Rui Fan" <1996fan...@gmail.com <mailto:
> 1996fan...@gmail.com> <mailto:
> > 1996fan...@gmail.com <mailto:1996fan...@gmail.com>>> wrote:
> >
> >
> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> >
> >
> >
> >
> >
> >
> > Sorry for the late reply.
> >
> >
> >
> >
> > > So you would have a high data skew while 1 subtask is receiving all the
> > data, but on average (say over 1-2 days) data skew would come down to 0
> > because all subtasks would have received their portion of the data.
> > > I'm inclined to think that the current proposal might still be fair, as
> > you do indeed have a skew by definition (but an intentional one). We can
> > have a few ways forward:
> > >
> > > 0) We can keep the behaviour as proposed. My thoughts are that data
> skew
> > is data skew, however intentional it may be. It is not necessarily bad,
> > like in your example.
> >
> >
> > It makes sense to me. Flink should show data skew correctly
> > regardless of whether the user is intentional or not.
> >
> >
> >
> >
> > > 1) Show data skew based on the beginning of time (not a live/current
> > score).
> > I mentioned some downsides to this in the FLIP: If you break or fix your
> > data skew recently, the historical data might hide the recent
> fix/breakage,
> > and it is inconsistent with the other metrics shown on the vertices e.g.
> > Backpressure/Busy metrics show the live/current score.
> > >
> > > 2) We can choose not to put data skew score on the vertices on the job
> > graph. And instead just use the new proposed Data Skew tab which could
> show
> > live/current skew score and the total data skew score from the beginning
> of
> > job.
> >
> >
> > It makes sense, we can show the current skew score in the DAG WebUI by
> > default,
> > and provide the total and current score in the detailed tab.
> >
> >
> > I didn't see the detailed design in the FLIP, would you mind
> > improve the design doc? Thanks
> >
> >
> > Also, I have 2 questions for now:
> >
> >
> > 1. About the current skew score, I still don't understand how to get
> > the list_of_number_of_records_received_by_each_subtask for
> > each subtask.
> >
> >
> > the list_of_number_of_records_received_by_each_subtask of subtask1
> > is total received records of subtask 1 from beginning to now -
> > total received records of subtask 1 from beginning to (now - 1min),
> right?
> >
> >
> > Note: 1min is an example. 30s or 2min is fine for me.
> >
> >
> > 2. The skew score is percent
> >
> >
> > I'm not sure whether the score shown in percent format is reasonable.
> > For busy ratio or backpressure ratio, they are shown in percent format
> > is intuitive.
> >
> >
> > IIUC, you proposed score is between 0% to 100%, and 0% is the best.
> > And the 100% is the worst.
> >
> >
> > For data skew, I'm not sure whether a multiple value is more intuitive.
> > It means data skew score = max / mean.
> >
> >
> > For example, we have 5 subtasks, the received record numbers are
> > [10,10, 10, 100, 10].
> > data skew score = max / mean = 100 / (140/5) = 100/ 28 = 3.57.
> >
> >
> > The data skew score is between 1 and infinity. 1 is the best, and
> > the bigger the worse.
> >
> >
> > Looking forward to your opinions.
> >
> >
> > Best,
> > Rui
> >
> >
> > On Tue, Jan 23, 2024 at 6:41 PM Kartoglu, Emre <kar...@amazon.co.uk.inva
> <mailto:kar...@amazon.co.uk.inva>
> > <mailto:kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva>>lid>
> > wrote:
> >
> >
> > > Hi Krzysztof,
> > >
> > > Thank you for the feedback! Please find my comments below.
> > >
> > > 1. Configurability
> > >
> > > Adding a feature flag / configuration to enable this is still on the
> > table
> > > as far as I am concerned. However I believe adding a new metric
> shouldn't
> > > warrant a flag/configuration. One might argue that we should have it
> for
> > > showing the metrics on the Flink UI, and I'd appreciate input on this.
> My
> > > default position is to not have a configuration/flag unless there is a
> > good
> > > reason (e.g. it turns out there is impact on Flink UI for so far
> unknown
> > > reason). This is because the proposed change should only be improving
> the
> > > experience without any unwanted side effect.
> > >
> > > 2. Metrics
> > >
> > > I agree the new metrics should be compatible with the rest of the Flink
> > > metric reporting mechanism. I will update the FLIP and propose names
> for
> > > the metrics.
> > >
> > > Kind regards,
> > > Emre
> > >
> > > On 23/01/2024, 10:31, "Krzysztof Dziołak" <kdzio...@live.com <mailto:
> kdzio...@live.com> <mailto:
> > kdzio...@live.com <mailto:kdzio...@live.com>> <mailto:
> > > kdzio...@live.com <mailto:kdzio...@live.com> <mailto:kdzio...@live.com
> <mailto:kdzio...@live.com>>>> wrote:
> > >
> > >
> > > CAUTION: This email originated from outside of the organization. Do not
> > > click links or open attachments unless you can confirm the sender and
> > know
> > > the content is safe.
> > >
> > >
> > >
> > >
> > >
> > >
> > > Hi Emre,
> > >
> > >
> > > Thank you for driving this proposal. I've got two questions about the
> > > extensions to the proposal that are not captured in the FLIP.
> > >
> > >
> > >
> > >
> > > 1. Configurability - what kind of configuration would you propose to
> > > maintain for this feature? Would On/off switch and/or aggregated period
> > > length be configurable? Should we capture the toggles in the FLIP ?
> > > 2. Metrics - are we planning to emit the skew metric via metric
> reporters
> > > mechanism. Should we capture proposed metric schema in the FLIP ?
> > >
> > >
> > > Kind regards,
> > > Krzysztof
> > >
> > >
> > > ________________________________
> > > From: Kartoglu, Emre <kar...@amazon.co.uk.inva <mailto:
> kar...@amazon.co.uk.inva> <mailto:
> > kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva>> <mailto:
> > > kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva> <mailto:
> kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva>>>LID>
> > > Sent: Monday, January 15, 2024 4:59 PM
> > > To: dev@flink.apache.org <mailto:dev@flink.apache.org> <mailto:
> dev@flink.apache.org <mailto:dev@flink.apache.org>> <mailto:
> > dev@flink.apache.org <mailto:dev@flink.apache.org> <mailto:
> dev@flink.apache.org <mailto:dev@flink.apache.org>>> <
> > > dev@flink.apache.org <mailto:dev@flink.apache.org> <mailto:
> dev@flink.apache.org <mailto:dev@flink.apache.org>> <mailto:
> > dev@flink.apache.org <mailto:dev@flink.apache.org> <mailto:
> dev@flink.apache.org <mailto:dev@flink.apache.org>>>>
> > > Subject: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard
> > >
> > >
> > > Hello,
> > >
> > >
> > > I’m opening this thread to discuss a FLIP[1] to make data skew more
> > > visible on Flink Dashboard.
> > >
> > >
> > > Data skew is currently not as visible as it should be. Users have to
> > click
> > > each operator and check how much data each sub-task is processing and
> > > compare the sub-tasks against each other. This is especially cumbersome
> > and
> > > error-prone for jobs with big job graphs and high parallelism. I’m
> > > proposing this FLIP to improve this.
> > >
> > >
> > > Kind regards,
> > > Emre
> > >
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> >
> > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> >
> > >
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> >
> > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> >
> > >
> > > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
> >
>
>
>
>

Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

Reply via email to