Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

2024-02-07 Thread Rui Fan
Thanks Emre for the feedback!

I still think max/mean is more simple and easy to understand
for users. But I don’t have a strong opinion about it.

This proposal is absolutely useful for flink users! In order to
ensure the value for users, would you mind if we wait for
a while and check if there is more feedback from the community.
Also, would you mind sharing these 2 solutions to the
user[1] & user-zh[2] mail list as well? Flink users may give some
valuable feedback there, thanks~

[1] u...@flink.apache.org
[2] user...@flink.apache.org

Best,
Rui

On Thu, Feb 1, 2024 at 5:52 PM Kartoglu, Emre 
wrote:

> Hi Rui,
>
> Thanks for the useful feedback and caring about the user experience.
> I will update the FLIP based on 1 comment. I consider this a minor update.
>
> Please find my detailed responses below.
>
> "numRecordsInPerSecond sounds make sense to me, and I think
> it's necessary to mention it in the FLIP wiki. It will let other developers
> to easily understand. WDYT?"
>
> I feel like this might be touching implementation details. No objections
> though,
>  I will update the FLIP with this as one of the ways in which we can
> achieve the proposal.
>
>
> "After I detailed read the FLIP and Average_absolute_deviation, we know
> 0% is the best, 100% is worst."
>
> Correct.
>
>
> "I guess it is difficult for users who have not read the documentation to
> know the meaning of 50%. We hope that the designed Data skew will
> be easy for users to understand without reading or learning a series
> of backgrounds."
>
> I think I understand where you're coming from. My thought is that the user
> won't have to
> know exactly how the skew percentage/score is calculated. But this score
> will
> act as a warning sign for them. Upon seeing a skew score of 80% for an
> operator, as a user
> I will go and click on the operator to see many of my subtasks are not
> receiving any data at all.
> So it acts as a metric to get the user's attention to the skewed operator
> and fix issues.
>
>
> "For example, as you mentioned before, flink has a metric:
> numRecordsInPerSecond.
> I believe users know what numRecordsInPerSecond means even if they
> didn't read any documentation."
>
> The FLIP suggests that we will provide an explanation of the data skew
> score
> under the proposed Data Skew tab. I would like the exact wording to be
> left to
> the code review process to prevent these from blocking the implementation
> work/progress.
> This will be a user-friendly explanation with an option for the curious
> user to see the exact formula.
>
>
> Kind regards,
> Emre
>
>
> On 01/02/2024, 03:26, "Rui Fan" <1996fan...@gmail.com  1996fan...@gmail.com>> wrote:
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
> > I was thinking about using the existing numRecordsInPerSecond metric
>
>
> numRecordsInPerSecond sounds make sense to me, and I think
> it's necessary to mention it in the FLIP wiki. It will let other developers
> to easily understand. WDYT?
>
>
> BTW, that's why I ask whether the data skew score means total
> receive records.
>
>
> > this would always give you a score higher than 1, with no way to cap the
> score.
>
>
> Yeah, you are right. max/mean is not a score, it's the data skew multiple.
> And I guess max/mean is easier to understand than
> Average_absolute_deviation.
>
>
> > I'm more used to working with percentages. The problem with the max/mean
> metric is I wouldn't immediately know whether a score of 300 is bad for
> instance.
> > Whereas if users saw above 50% as suggested in the FLIP for instance,
> they would consider taking action. I'm tempted to push back on this
> suggestion. Happy to discuss further, there is a chance I'm not seeing the
> downside of the proposed percentage based metric yet. Please let me know.
>
>
> After I detailed read the FLIP and Average_absolute_deviation, we know
> 0% is the best, 100% is worst.
>
>
> I guess it is difficult for users who have not read the documentation to
> know the meaning of 50%. We hope that the designed Data skew will
> be easy for users to understand without reading or learning a series
> of backgrounds.
>
>
> For example, as you mentioned before, flink has a metric:
> numRecordsInPerSecond.
> I believe users know what numRecordsInPerSecond means even if they
> didn't read any documentation.
>
>
> Of course, I'm opening for it. I may have missed something. I'd like to
> hear
> more feedback from the community.
>
>
> Best,
> Rui
>
>
> On Thu, Feb 1, 2024 at 4:13 AM Kartoglu, Emre  lid>
> wrote:
>
>
> > Hi Rui,
> >
> > " and provide the total and current score in the detailed tab. I didn't
> > see the detailed design in the FLIP, would you mind
> > improve the design doc? Thanks".
> >
> > It will essentially be a basic list view similar to the "Checkpoints"
> tab.
> > I only briefly mentioned 

Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

2024-02-01 Thread Kartoglu, Emre
Hi Rui,

Thanks for the useful feedback and caring about the user experience. 
I will update the FLIP based on 1 comment. I consider this a minor update.

Please find my detailed responses below. 

"numRecordsInPerSecond sounds make sense to me, and I think
it's necessary to mention it in the FLIP wiki. It will let other developers
to easily understand. WDYT?"

I feel like this might be touching implementation details. No objections though,
 I will update the FLIP with this as one of the ways in which we can achieve 
the proposal.


"After I detailed read the FLIP and Average_absolute_deviation, we know
0% is the best, 100% is worst."

Correct.


"I guess it is difficult for users who have not read the documentation to
know the meaning of 50%. We hope that the designed Data skew will
be easy for users to understand without reading or learning a series
of backgrounds."

I think I understand where you're coming from. My thought is that the user 
won't have to
know exactly how the skew percentage/score is calculated. But this score will
act as a warning sign for them. Upon seeing a skew score of 80% for an 
operator, as a user 
I will go and click on the operator to see many of my subtasks are not 
receiving any data at all.
So it acts as a metric to get the user's attention to the skewed operator and 
fix issues.


"For example, as you mentioned before, flink has a metric:
numRecordsInPerSecond.
I believe users know what numRecordsInPerSecond means even if they
didn't read any documentation."

The FLIP suggests that we will provide an explanation of the data skew score
under the proposed Data Skew tab. I would like the exact wording to be left to 
the code review process to prevent these from blocking the implementation 
work/progress. 
This will be a user-friendly explanation with an option for the curious user to 
see the exact formula.


Kind regards,
Emre


On 01/02/2024, 03:26, "Rui Fan" <1996fan...@gmail.com 
> wrote:


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.






> I was thinking about using the existing numRecordsInPerSecond metric


numRecordsInPerSecond sounds make sense to me, and I think
it's necessary to mention it in the FLIP wiki. It will let other developers
to easily understand. WDYT?


BTW, that's why I ask whether the data skew score means total
receive records.


> this would always give you a score higher than 1, with no way to cap the
score.


Yeah, you are right. max/mean is not a score, it's the data skew multiple.
And I guess max/mean is easier to understand than
Average_absolute_deviation.


> I'm more used to working with percentages. The problem with the max/mean
metric is I wouldn't immediately know whether a score of 300 is bad for
instance.
> Whereas if users saw above 50% as suggested in the FLIP for instance,
they would consider taking action. I'm tempted to push back on this
suggestion. Happy to discuss further, there is a chance I'm not seeing the
downside of the proposed percentage based metric yet. Please let me know.


After I detailed read the FLIP and Average_absolute_deviation, we know
0% is the best, 100% is worst.


I guess it is difficult for users who have not read the documentation to
know the meaning of 50%. We hope that the designed Data skew will
be easy for users to understand without reading or learning a series
of backgrounds.


For example, as you mentioned before, flink has a metric:
numRecordsInPerSecond.
I believe users know what numRecordsInPerSecond means even if they
didn't read any documentation.


Of course, I'm opening for it. I may have missed something. I'd like to
hear
more feedback from the community.


Best,
Rui


On Thu, Feb 1, 2024 at 4:13 AM Kartoglu, Emre mailto:kar...@amazon.co.uk.inva>lid>
wrote:


> Hi Rui,
>
> " and provide the total and current score in the detailed tab. I didn't
> see the detailed design in the FLIP, would you mind
> improve the design doc? Thanks".
>
> It will essentially be a basic list view similar to the "Checkpoints" tab.
> I only briefly mentioned this in the FLIP because it will be a basic list
> view.
> No problem though, I will update the FLIP.
>
>
> Please find my responses below quotations.
>
> " 1. About the current skew score, I still don't understand how to get
> the list_of_number_of_records_received_by_each_subtask for
> each subtask.
>
> the list_of_number_of_records_received_by_each_subtask of subtask1
> is
>
> total received records of subtask 1 from beginning to now -
> total received records of subtask 1 from beginning to (now - 1min), right?"
>
> Yes, essentially correct. I was thinking about using the existing
> numRecordsInPerSecond metric (see
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/ 
> ),
> this would give us per second 

Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

2024-01-31 Thread Rui Fan
> I was thinking about using the existing numRecordsInPerSecond metric

numRecordsInPerSecond sounds make sense to me, and I think
it's necessary to mention it in the FLIP wiki. It will let other developers
to easily understand. WDYT?

BTW, that's why I ask whether the data skew score means total
receive records.

> this would always give you a score higher than 1, with no way to cap the
score.

Yeah, you are right. max/mean is not a score, it's the data skew multiple.
And I guess max/mean is easier to understand than
Average_absolute_deviation.

> I'm more used to working with percentages. The problem with the max/mean
metric is I wouldn't immediately know whether a score of 300 is bad for
instance.
> Whereas if users saw above 50% as suggested in the FLIP for instance,
they would consider taking action. I'm tempted to push back on this
suggestion. Happy to discuss further, there is a chance I'm not seeing the
downside of the proposed percentage based metric yet. Please let me know.

After I detailed read the FLIP and Average_absolute_deviation, we know
0% is the best, 100% is worst.

I guess it is difficult for users who have not read the documentation to
know the meaning of 50%. We hope that the designed Data skew will
be easy for users to understand without reading or learning a series
of backgrounds.

For example, as you mentioned before, flink has a metric:
numRecordsInPerSecond.
I believe users know what numRecordsInPerSecond means even if they
didn't read any documentation.

Of course, I'm opening for it. I may have missed something. I'd like to
hear
more feedback from the community.

Best,
Rui

On Thu, Feb 1, 2024 at 4:13 AM Kartoglu, Emre 
wrote:

> Hi Rui,
>
> " and provide the total and current score in the detailed tab. I didn't
> see the detailed design in the FLIP, would you mind
> improve the design doc? Thanks".
>
> It will essentially be a basic list view similar to the "Checkpoints" tab.
> I only briefly mentioned this in the FLIP because it will be a basic list
> view.
> No problem though, I will update the FLIP.
>
>
> Please find my responses below quotations.
>
> " 1. About the current skew score, I still don't understand how to get
> the list_of_number_of_records_received_by_each_subtask for
> each subtask.
>
> the list_of_number_of_records_received_by_each_subtask of subtask1
> is
>
> total received records of subtask 1 from beginning to now -
> total received records of subtask 1 from beginning to (now - 1min), right?"
>
> Yes, essentially correct. I was thinking about using the existing
> numRecordsInPerSecond metric (see
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/),
> this would give us per second granularity and this would be more
> "current/live" than per minute.
>
>
> "IIUC, you proposed score is between 0% to 100%, and 0% is the best.
> And the 100% is the worst."
>
> Correct.
>
>
> " For data skew, I'm not sure whether a multiple value is more intuitive.
> It means data skew score = max / mean.
>  The data skew score is between 1 and infinity. 1 is the best, and
> the bigger the worse."
>
> I'm not sure I follow you here. Yes, this would always give you a score
> higher than 1, with no way to cap the score.
> I'm more used to working with percentages. The problem with the max/mean
> metric is I wouldn't immediately know whether a score of 300 is bad for
> instance.
> Whereas if users saw above 50% as suggested in the FLIP for instance, they
> would consider taking action. I'm tempted to push back on this suggestion.
> Happy to discuss further, there is a chance I'm not seeing the downside of
> the proposed percentage based metric yet. Please let me know.
>
> Kind regards,
> Emre
>
> On 31/01/2024, 10:57, "Rui Fan" <1996fan...@gmail.com  1996fan...@gmail.com>> wrote:
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
> Sorry for the late reply.
>
>
>
>
> > So you would have a high data skew while 1 subtask is receiving all the
> data, but on average (say over 1-2 days) data skew would come down to 0
> because all subtasks would have received their portion of the data.
> > I'm inclined to think that the current proposal might still be fair, as
> you do indeed have a skew by definition (but an intentional one). We can
> have a few ways forward:
> >
> > 0) We can keep the behaviour as proposed. My thoughts are that data skew
> is data skew, however intentional it may be. It is not necessarily bad,
> like in your example.
>
>
> It makes sense to me. Flink should show data skew correctly
> regardless of whether the user is intentional or not.
>
>
>
>
> > 1) Show data skew based on the beginning of time (not a live/current
> score).
> I mentioned some downsides to this in the FLIP: If you break or fix your
> data skew recently, the historical data might hide the recent fix/breakage,
> and it is inconsistent with 

Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

2024-01-31 Thread Kartoglu, Emre
Hi Rui,

" and provide the total and current score in the detailed tab. I didn't see the 
detailed design in the FLIP, would you mind
improve the design doc? Thanks".

It will essentially be a basic list view similar to the "Checkpoints" tab. I 
only briefly mentioned this in the FLIP because it will be a basic list view.
No problem though, I will update the FLIP.


Please find my responses below quotations.

" 1. About the current skew score, I still don't understand how to get
the list_of_number_of_records_received_by_each_subtask for
each subtask.

the list_of_number_of_records_received_by_each_subtask of subtask1
is 

total received records of subtask 1 from beginning to now -
total received records of subtask 1 from beginning to (now - 1min), right?"

Yes, essentially correct. I was thinking about using the existing 
numRecordsInPerSecond metric (see 
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/), this 
would give us per second granularity and this would be more "current/live" than 
per minute.


"IIUC, you proposed score is between 0% to 100%, and 0% is the best.
And the 100% is the worst."

Correct.


" For data skew, I'm not sure whether a multiple value is more intuitive.
It means data skew score = max / mean.
 The data skew score is between 1 and infinity. 1 is the best, and
the bigger the worse."

I'm not sure I follow you here. Yes, this would always give you a score higher 
than 1, with no way to cap the score.
I'm more used to working with percentages. The problem with the max/mean metric 
is I wouldn't immediately know whether a score of 300 is bad for instance.
Whereas if users saw above 50% as suggested in the FLIP for instance, they 
would consider taking action. I'm tempted to push back on this suggestion. 
Happy to discuss further, there is a chance I'm not seeing the downside of the 
proposed percentage based metric yet. Please let me know.

Kind regards,
Emre

On 31/01/2024, 10:57, "Rui Fan" <1996fan...@gmail.com 
> wrote:


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.






Sorry for the late reply.




> So you would have a high data skew while 1 subtask is receiving all the
data, but on average (say over 1-2 days) data skew would come down to 0
because all subtasks would have received their portion of the data.
> I'm inclined to think that the current proposal might still be fair, as
you do indeed have a skew by definition (but an intentional one). We can
have a few ways forward:
>
> 0) We can keep the behaviour as proposed. My thoughts are that data skew
is data skew, however intentional it may be. It is not necessarily bad,
like in your example.


It makes sense to me. Flink should show data skew correctly
regardless of whether the user is intentional or not.




> 1) Show data skew based on the beginning of time (not a live/current score).
I mentioned some downsides to this in the FLIP: If you break or fix your
data skew recently, the historical data might hide the recent fix/breakage,
and it is inconsistent with the other metrics shown on the vertices e.g.
Backpressure/Busy metrics show the live/current score.
>
> 2) We can choose not to put data skew score on the vertices on the job
graph. And instead just use the new proposed Data Skew tab which could show
live/current skew score and the total data skew score from the beginning of
job.


It makes sense, we can show the current skew score in the DAG WebUI by
default,
and provide the total and current score in the detailed tab.


I didn't see the detailed design in the FLIP, would you mind
improve the design doc? Thanks


Also, I have 2 questions for now:


1. About the current skew score, I still don't understand how to get
the list_of_number_of_records_received_by_each_subtask for
each subtask.


the list_of_number_of_records_received_by_each_subtask of subtask1
is total received records of subtask 1 from beginning to now -
total received records of subtask 1 from beginning to (now - 1min), right?


Note: 1min is an example. 30s or 2min is fine for me.


2. The skew score is percent


I'm not sure whether the score shown in percent format is reasonable.
For busy ratio or backpressure ratio, they are shown in percent format
is intuitive.


IIUC, you proposed score is between 0% to 100%, and 0% is the best.
And the 100% is the worst.


For data skew, I'm not sure whether a multiple value is more intuitive.
It means data skew score = max / mean.


For example, we have 5 subtasks, the received record numbers are
[10,10, 10, 100, 10].
data skew score = max / mean = 100 / (140/5) = 100/ 28 = 3.57.


The data skew score is between 1 and infinity. 1 is the best, and
the bigger the worse.


Looking forward to your opinions.


Best,
Rui


On Tue, Jan 23, 2024 at 6:41 PM Kartoglu, Emre mailto:kar...@amazon.co.uk.inva>lid>
wrote:


> Hi Krzysztof,
>
> Thank 

Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

2024-01-31 Thread Rui Fan
Sorry for the late reply.


> So you would have a high data skew while 1 subtask is receiving all the
data, but on average (say over 1-2 days) data skew would come down to 0
because all subtasks would have received their portion of the data.
> I'm inclined to think that the current proposal might still be fair, as
you do indeed have a skew by definition (but an intentional one). We can
have a few ways forward:
>
> 0) We can keep the behaviour as proposed. My thoughts are that data skew
is data skew, however intentional it may be. It is not necessarily bad,
like in your example.

It makes sense to me. Flink should show data skew correctly
regardless of whether the user is intentional or not.


> 1) Show data skew based on the beginning of time (not a live/current score).
I mentioned some downsides to this in the FLIP: If you break or fix your
data skew recently, the historical data might hide the recent fix/breakage,
and it is inconsistent with the other metrics shown on the vertices e.g.
Backpressure/Busy metrics show the live/current score.
>
> 2) We can choose not to put data skew score on the vertices on the job
graph. And instead just use the new proposed Data Skew tab which could show
live/current skew score and the total data skew score from the beginning of
job.

It makes sense, we can show the current skew score in the DAG WebUI by
default,
and provide the total and current score in the detailed tab.

I didn't see the detailed design in the FLIP, would you mind
improve the design doc? Thanks

Also, I have 2 questions for now:

1. About the current skew score, I still don't understand how to get
the list_of_number_of_records_received_by_each_subtask for
each subtask.

the list_of_number_of_records_received_by_each_subtask of subtask1
is total received records of subtask 1 from beginning to now -
total received records of subtask 1 from beginning to (now - 1min), right?

Note: 1min is an example. 30s or 2min is fine for me.

2. The skew score is percent

I'm not sure whether the score shown in percent format is reasonable.
For busy ratio or backpressure ratio, they are shown in percent format
is intuitive.

IIUC, you proposed score is between 0% to 100%, and 0% is the best.
And the 100% is the worst.

For data skew, I'm not sure whether a multiple value is more intuitive.
It means data skew score = max / mean.

For example, we have 5 subtasks, the received record numbers are
[10,10, 10, 100, 10].
data skew score = max / mean = 100 / (140/5) = 100/ 28 = 3.57.

The data skew score is between 1 and infinity. 1 is the best, and
the bigger the worse.

Looking forward to your opinions.

Best,
Rui

On Tue, Jan 23, 2024 at 6:41 PM Kartoglu, Emre 
wrote:

> Hi Krzysztof,
>
> Thank you for the feedback! Please find my comments below.
>
> 1. Configurability
>
> Adding a feature flag / configuration to enable this is still on the table
> as far as I am concerned. However I believe adding a new metric shouldn't
> warrant a flag/configuration. One might argue that we should have it for
> showing the metrics on the Flink UI, and I'd appreciate input on this. My
> default position is to not have a configuration/flag unless there is a good
> reason (e.g. it turns out there is impact on Flink UI for so far unknown
> reason). This is because the proposed change should only be improving the
> experience without any unwanted side effect.
>
> 2. Metrics
>
> I agree the new metrics should be compatible with the rest of the Flink
> metric reporting mechanism. I will update the FLIP and propose names for
> the metrics.
>
> Kind regards,
> Emre
>
> On 23/01/2024, 10:31, "Krzysztof Dziołak"  kdzio...@live.com>> wrote:
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
> Hi Emre,
>
>
> Thank you for driving this proposal. I've got two questions about the
> extensions to the proposal that are not captured in the FLIP.
>
>
>
>
> 1. Configurability - what kind of configuration would you propose to
> maintain for this feature? Would On/off switch and/or aggregated period
> length be configurable? Should we capture the toggles in the FLIP ?
> 2. Metrics - are we planning to emit the skew metric via metric reporters
> mechanism. Should we capture proposed metric schema in the FLIP ?
>
>
> Kind regards,
> Krzysztof
>
>
> 
> From: Kartoglu, Emre  kar...@amazon.co.uk.inva>LID>
> Sent: Monday, January 15, 2024 4:59 PM
> To: dev@flink.apache.org  <
> dev@flink.apache.org >
> Subject: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard
>
>
> Hello,
>
>
> I’m opening this thread to discuss a FLIP[1] to make data skew more
> visible on Flink Dashboard.
>
>
> Data skew is currently not as visible as it should be. Users have to click
> each operator and check how much data each sub-task is processing and
> 

Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

2024-01-23 Thread Kartoglu, Emre
Hi Krzysztof,

Thank you for the feedback! Please find my comments below.

1. Configurability

Adding a feature flag / configuration to enable this is still on the table as 
far as I am concerned. However I believe adding a new metric shouldn't warrant 
a flag/configuration. One might argue that we should have it for showing the 
metrics on the Flink UI, and I'd appreciate input on this. My default position 
is to not have a configuration/flag unless there is a good reason (e.g. it 
turns out there is impact on Flink UI for so far unknown reason). This is 
because the proposed change should only be improving the experience without any 
unwanted side effect.

2. Metrics

I agree the new metrics should be compatible with the rest of the Flink metric 
reporting mechanism. I will update the FLIP and propose names for the metrics.

Kind regards,
Emre

On 23/01/2024, 10:31, "Krzysztof Dziołak" mailto:kdzio...@live.com>> wrote:


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.






Hi Emre,


Thank you for driving this proposal. I've got two questions about the 
extensions to the proposal that are not captured in the FLIP.




1. Configurability - what kind of configuration would you propose to maintain 
for this feature? Would On/off switch and/or aggregated period length be 
configurable? Should we capture the toggles in the FLIP ?
2. Metrics - are we planning to emit the skew metric via metric reporters 
mechanism. Should we capture proposed metric schema in the FLIP ?


Kind regards,
Krzysztof



From: Kartoglu, Emre mailto:kar...@amazon.co.uk.inva>LID>
Sent: Monday, January 15, 2024 4:59 PM
To: dev@flink.apache.org  mailto:dev@flink.apache.org>>
Subject: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard


Hello,


I’m opening this thread to discuss a FLIP[1] to make data skew more visible on 
Flink Dashboard.


Data skew is currently not as visible as it should be. Users have to click each 
operator and check how much data each sub-task is processing and compare the 
sub-tasks against each other. This is especially cumbersome and error-prone for 
jobs with big job graphs and high parallelism. I’m proposing this FLIP to 
improve this.


Kind regards,
Emre


[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
 












Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

2024-01-23 Thread Krzysztof Dziołak
Hi Emre,

Thank you for driving this proposal. I've got two questions about the 
extensions to the proposal that are not captured in the FLIP.


  1.  Configurability - what kind of configuration would you propose to 
maintain for this feature? Would On/off switch and/or aggregated period length 
be configurable? Should we capture the toggles in the FLIP ?
  2.  Metrics - are we planning to emit the skew metric via metric reporters 
mechanism. Should we capture proposed metric schema in the FLIP ?

Kind regards,
Krzysztof


From: Kartoglu, Emre 
Sent: Monday, January 15, 2024 4:59 PM
To: dev@flink.apache.org 
Subject: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

Hello,

I’m opening this thread to discuss a FLIP[1] to make data skew more visible on 
Flink Dashboard.

Data skew is currently not as visible as it should be. Users have to click each 
operator and check how much data each sub-task is processing and compare the 
sub-tasks against each other. This is especially cumbersome and error-prone for 
jobs with big job graphs and high parallelism. I’m proposing this FLIP to 
improve this.

Kind regards,
Emre

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard





Re: Re:[DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

2024-01-16 Thread Kartoglu, Emre
Hi Xuyang,

Thanks for the feedback! Please find my response below.

> 1. How will the colors of vertics with high data skew scores be unified with 
> existing backpressure and high busyness
colors on the UI? Users should be able to distinguish at a glance which vertics 
in the entire job graph is skewed.

The current proposal does not suggest to change the colours of the vertices 
based on data skew. In another exchange with Rui, we touch on why data skew 
might not necessarily be bad (for instance if data skew is the designed 
behaviour). The colours are currently dedicated to the Busy/Backpressure 
metrics. I would not be keen on introducing another colour or using the same 
colours for data skew as I am not sure if that'll help or confuse users. I am 
also keen to keep the scope of this FLIP as minimal as possible with as few 
contentious points as possible. We could also revisit this point in future 
FLIPs, if it does not become a blocker for this one. Please let me know your 
thoughts.

2. Can you tell me that you prefer to unify Data Skew Score and Exception tab? 
In my opinion, Data Skew Score is in
the same category as the existing Backpressured and Busy metrics.

The FLIP does not propose to unify the Data Skew tab and the Exception tab. The 
proposed Data Skew tab would sit next to the Exception tab (but I'm not too 
opinionated on where it sits). Backpressure and Busy metrics are somewhat 
special in that they have high visibility thanks to the vertices changing 
colours based on their value. I agree that Data Skew is in the same category in 
that it can be used as an indicator of the job's health. I'm not sure if the 
suggestion here then is to not introduce a tab for data skew? I'd appreciate 
some clarification here.

Look forward to hearing your thoughts.

Emre


On 16/01/2024, 06:05, "Xuyang" mailto:xyzhong...@163.com>> wrote:


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.






Hi, Emre.




In large-scale production jobs, the phenomenon of data skew often occurs. 
Having an metric on the UI that
reflects data skew without the need for manual inspection of each vertex by 
clicking on them would be quite cool.
This could help users quickly identify problematic nodes, simplifying 
development and operations.




I'm mainly curious about two minor points:
1. How will the colors of vertics with high data skew scores be unified with 
existing backpressure and high busyness
colors on the UI? Users should be able to distinguish at a glance which vertics 
in the entire job graph is skewed.
2. Can you tell me that you prefer to unify Data Skew Score and Exception tab? 
In my opinion, Data Skew Score is in
the same category as the existing Backpressured and Busy metrics.




Looking forward to your reply.






--


Best!
Xuyang










At 2024-01-16 00:59:57, "Kartoglu, Emre" mailto:kar...@amazon.co.uk.inva>LID> wrote:
>Hello,
>
>I’m opening this thread to discuss a FLIP[1] to make data skew more visible on 
>Flink Dashboard.
>
>Data skew is currently not as visible as it should be. Users have to click 
>each operator and check how much data each sub-task is processing and compare 
>the sub-tasks against each other. This is especially cumbersome and 
>error-prone for jobs with big job graphs and high parallelism. I’m proposing 
>this FLIP to improve this.
>
>Kind regards,
>Emre
>
>[1] 
>https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> 
>
>
>
>





Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

2024-01-16 Thread Kartoglu, Emre
Hi Rui,

Thanks for the feedback. Please find my response below:

> The number_of_records_received_by_each_subtask is the total received records, 
> right?

No it's not the total. I understand why this is confusing. I had initially 
wanted to name it "the list of number of records received by each subtask". So 
its type is a list. Example: [10, 10, 10] => 3 sub-tasks and each one received 
10 records. 

In your example, you have subtasks with each one designed to receive records at 
different times of the day. I hadn't thought about this use case! 
So you would have a high data skew while 1 subtask is receiving all the data, 
but on average (say over 1-2 days) data skew would come down to 0 because all 
subtasks would have received their portion of the data.
I'm inclined to think that the current proposal might still be fair, as you do 
indeed have a skew by definition (but an intentional one). We can have a few 
ways forward:

0) We can keep the behaviour as proposed. My thoughts are that data skew is 
data skew, however intentional it may be. It is not necessarily bad, like in 
your example.

1) Show data skew based on the beginning of time (not a live/current score). I 
mentioned some downsides to this in the FLIP: If you break or fix your data 
skew recently, the historical data might hide the recent fix/breakage, and it 
is inconsistent with the other metrics shown on the vertices e.g. 
Backpressure/Busy metrics show the live/current score.

2) We can choose not to put data skew score on the vertices on the job graph. 
And instead just use the new proposed Data Skew tab which could show 
live/current skew score and the total data skew score from the beginning of job.

Keen to hear your thoughts.

Kind regards,
Emre


On 16/01/2024, 06:44, "Rui Fan" <1996fan...@gmail.com 
> wrote:


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.






Thanks Emre for driving this proposal!


It's very useful for troubleshooting.


I have a question:


The number_of_records_received_by_each_subtask is the
total received records, right?


I'm not sure whether we should check data skew based on
the latest duration period.


In the production, I found the the total received records of
all subtasks is balanced, but in the each time period, they
are skew.


For example, a flink job has `group by` or `keyBy` based on
hour field. It mean:
- In the 0-1 o'clock, subtaskA is busy, the rest of subtasks are idle.
- In the 1-2 o'clock, subtaskB is busy, the rest of subtasks are idle.
- Next hour, the busy subtask is changed.


Looking forward to your opinions~


Best,
Rui


On Tue, Jan 16, 2024 at 2:05 PM Xuyang mailto:xyzhong...@163.com>> wrote:


> Hi, Emre.
>
>
> In large-scale production jobs, the phenomenon of data skew often occurs.
> Having an metric on the UI that
> reflects data skew without the need for manual inspection of each vertex
> by clicking on them would be quite cool.
> This could help users quickly identify problematic nodes, simplifying
> development and operations.
>
>
> I'm mainly curious about two minor points:
> 1. How will the colors of vertics with high data skew scores be unified
> with existing backpressure and high busyness
> colors on the UI? Users should be able to distinguish at a glance which
> vertics in the entire job graph is skewed.
> 2. Can you tell me that you prefer to unify Data Skew Score and Exception
> tab? In my opinion, Data Skew Score is in
> the same category as the existing Backpressured and Busy metrics.
>
>
> Looking forward to your reply.
>
>
>
> --
>
> Best!
> Xuyang
>
>
>
>
>
> At 2024-01-16 00:59:57, "Kartoglu, Emre"  LID>
> wrote:
> >Hello,
> >
> >I’m opening this thread to discuss a FLIP[1] to make data skew more
> visible on Flink Dashboard.
> >
> >Data skew is currently not as visible as it should be. Users have to
> click each operator and check how much data each sub-task is processing and
> compare the sub-tasks against each other. This is especially cumbersome and
> error-prone for jobs with big job graphs and high parallelism. I’m
> proposing this FLIP to improve this.
> >
> >Kind regards,
> >Emre
> >
> >[1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
>  
> 
> >
> >
> >
>





Re: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

2024-01-15 Thread Rui Fan
Thanks Emre for driving this proposal!

It's very useful for troubleshooting.

I have a question:

The number_of_records_received_by_each_subtask is the
total received records, right?

I'm not sure whether we should check data skew based on
the latest duration period.

In the production, I found the the total received records of
all subtasks is balanced, but in the each time period, they
are skew.

For example, a flink job has `group by` or `keyBy` based on
hour field. It mean:
- In the 0-1 o'clock, subtaskA is busy, the rest of subtasks are idle.
- In the 1-2 o'clock, subtaskB is busy, the rest of subtasks are idle.
- Next hour, the busy subtask is changed.

Looking forward to your opinions~

Best,
Rui

On Tue, Jan 16, 2024 at 2:05 PM Xuyang  wrote:

> Hi, Emre.
>
>
> In large-scale production jobs, the phenomenon of data skew often occurs.
> Having an metric on the UI that
> reflects data skew without the need for manual inspection of each vertex
> by clicking on them would be quite cool.
> This could help users quickly identify problematic nodes, simplifying
> development and operations.
>
>
> I'm mainly curious about two minor points:
> 1. How will the colors of vertics with high data skew scores be unified
> with existing backpressure and high busyness
> colors on the UI? Users should be able to distinguish at a glance which
> vertics in the entire job graph is skewed.
> 2. Can you tell me that you prefer to unify Data Skew Score and Exception
> tab? In my opinion, Data Skew Score is in
> the same category as the existing Backpressured and Busy metrics.
>
>
> Looking forward to your reply.
>
>
>
> --
>
> Best!
> Xuyang
>
>
>
>
>
> At 2024-01-16 00:59:57, "Kartoglu, Emre" 
> wrote:
> >Hello,
> >
> >I’m opening this thread to discuss a FLIP[1] to make data skew more
> visible on Flink Dashboard.
> >
> >Data skew is currently not as visible as it should be. Users have to
> click each operator and check how much data each sub-task is processing and
> compare the sub-tasks against each other. This is especially cumbersome and
> error-prone for jobs with big job graphs and high parallelism. I’m
> proposing this FLIP to improve this.
> >
> >Kind regards,
> >Emre
> >
> >[1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
> >
> >
> >
>


Re:[DISCUSS] FLIP-418: Show data skew score on Flink Dashboard

2024-01-15 Thread Xuyang
Hi, Emre.


In large-scale production jobs, the phenomenon of data skew often occurs. 
Having an metric on the UI that
reflects data skew without the need for manual inspection of each vertex by 
clicking on them would be quite cool.
This could help users quickly identify problematic nodes, simplifying 
development and operations.


I'm mainly curious about two minor points:
1. How will the colors of vertics with high data skew scores be unified with 
existing backpressure and high busyness
colors on the UI? Users should be able to distinguish at a glance which vertics 
in the entire job graph is skewed.
2. Can you tell me that you prefer to unify Data Skew Score and Exception tab? 
In my opinion, Data Skew Score is in 
the same category as the existing Backpressured and Busy metrics.


Looking forward to your reply.



--

Best!
Xuyang





At 2024-01-16 00:59:57, "Kartoglu, Emre"  wrote:
>Hello,
>
>I’m opening this thread to discuss a FLIP[1] to make data skew more visible on 
>Flink Dashboard.
>
>Data skew is currently not as visible as it should be. Users have to click 
>each operator and check how much data each sub-task is processing and compare 
>the sub-tasks against each other. This is especially cumbersome and 
>error-prone for jobs with big job graphs and high parallelism. I’m proposing 
>this FLIP to improve this.
>
>Kind regards,
>Emre
>
>[1] 
>https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
>
>
>