Hi Rui,

" and provide the total and current score in the detailed tab. I didn't see the 
detailed design in the FLIP, would you mind
improve the design doc? Thanks".

It will essentially be a basic list view similar to the "Checkpoints" tab. I 
only briefly mentioned this in the FLIP because it will be a basic list view.
No problem though, I will update the FLIP.


Please find my responses below quotations.

" 1. About the current skew score, I still don't understand how to get
the list_of_number_of_records_received_by_each_subtask for
each subtask.

the list_of_number_of_records_received_by_each_subtask of subtask1
is 

total received records of subtask 1 from beginning to now -
total received records of subtask 1 from beginning to (now - 1min), right?"

Yes, essentially correct. I was thinking about using the existing 
numRecordsInPerSecond metric (see 
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/), this 
would give us per second granularity and this would be more "current/live" than 
per minute.


"IIUC, you proposed score is between 0% to 100%, and 0% is the best.
And the 100% is the worst."

Correct.


" For data skew, I'm not sure whether a multiple value is more intuitive.
It means data skew score = max / mean.
 The data skew score is between 1 and infinity. 1 is the best, and
the bigger the worse."

I'm not sure I follow you here. Yes, this would always give you a score higher 
than 1, with no way to cap the score.
I'm more used to working with percentages. The problem with the max/mean metric 
is I wouldn't immediately know whether a score of 300 is bad for instance.
Whereas if users saw above 50% as suggested in the FLIP for instance, they 
would consider taking action. I'm tempted to push back on this suggestion. 
Happy to discuss further, there is a chance I'm not seeing the downside of the 
proposed percentage based metric yet. Please let me know.

Kind regards,
Emre

On 31/01/2024, 10:57, "Rui Fan" <1996fan...@gmail.com 
<mailto:1996fan...@gmail.com>> wrote:


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.






Sorry for the late reply.




> So you would have a high data skew while 1 subtask is receiving all the
data, but on average (say over 1-2 days) data skew would come down to 0
because all subtasks would have received their portion of the data.
> I'm inclined to think that the current proposal might still be fair, as
you do indeed have a skew by definition (but an intentional one). We can
have a few ways forward:
>
> 0) We can keep the behaviour as proposed. My thoughts are that data skew
is data skew, however intentional it may be. It is not necessarily bad,
like in your example.


It makes sense to me. Flink should show data skew correctly
regardless of whether the user is intentional or not.




> 1) Show data skew based on the beginning of time (not a live/current score).
I mentioned some downsides to this in the FLIP: If you break or fix your
data skew recently, the historical data might hide the recent fix/breakage,
and it is inconsistent with the other metrics shown on the vertices e.g.
Backpressure/Busy metrics show the live/current score.
>
> 2) We can choose not to put data skew score on the vertices on the job
graph. And instead just use the new proposed Data Skew tab which could show
live/current skew score and the total data skew score from the beginning of
job.


It makes sense, we can show the current skew score in the DAG WebUI by
default,
and provide the total and current score in the detailed tab.


I didn't see the detailed design in the FLIP, would you mind
improve the design doc? Thanks


Also, I have 2 questions for now:


1. About the current skew score, I still don't understand how to get
the list_of_number_of_records_received_by_each_subtask for
each subtask.


the list_of_number_of_records_received_by_each_subtask of subtask1
is total received records of subtask 1 from beginning to now -
total received records of subtask 1 from beginning to (now - 1min), right?


Note: 1min is an example. 30s or 2min is fine for me.


2. The skew score is percent


I'm not sure whether the score shown in percent format is reasonable.
For busy ratio or backpressure ratio, they are shown in percent format
is intuitive.


IIUC, you proposed score is between 0% to 100%, and 0% is the best.
And the 100% is the worst.


For data skew, I'm not sure whether a multiple value is more intuitive.
It means data skew score = max / mean.


For example, we have 5 subtasks, the received record numbers are
[10,10, 10, 100, 10].
data skew score = max / mean = 100 / (140/5) = 100/ 28 = 3.57.


The data skew score is between 1 and infinity. 1 is the best, and
the bigger the worse.


Looking forward to your opinions.


Best,
Rui


On Tue, Jan 23, 2024 at 6:41 PM Kartoglu, Emre <kar...@amazon.co.uk.inva 
<mailto:kar...@amazon.co.uk.inva>lid>
wrote:


> Hi Krzysztof,
>
> Thank you for the feedback! Please find my comments below.
>
> 1. Configurability
>
> Adding a feature flag / configuration to enable this is still on the table
> as far as I am concerned. However I believe adding a new metric shouldn't
> warrant a flag/configuration. One might argue that we should have it for
> showing the metrics on the Flink UI, and I'd appreciate input on this. My
> default position is to not have a configuration/flag unless there is a good
> reason (e.g. it turns out there is impact on Flink UI for so far unknown
> reason). This is because the proposed change should only be improving the
> experience without any unwanted side effect.
>
> 2. Metrics
>
> I agree the new metrics should be compatible with the rest of the Flink
> metric reporting mechanism. I will update the FLIP and propose names for
> the metrics.
>
> Kind regards,
> Emre
>
> On 23/01/2024, 10:31, "Krzysztof Dziołak" <kdzio...@live.com 
> <mailto:kdzio...@live.com> <mailto:
> kdzio...@live.com <mailto:kdzio...@live.com>>> wrote:
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
> Hi Emre,
>
>
> Thank you for driving this proposal. I've got two questions about the
> extensions to the proposal that are not captured in the FLIP.
>
>
>
>
> 1. Configurability - what kind of configuration would you propose to
> maintain for this feature? Would On/off switch and/or aggregated period
> length be configurable? Should we capture the toggles in the FLIP ?
> 2. Metrics - are we planning to emit the skew metric via metric reporters
> mechanism. Should we capture proposed metric schema in the FLIP ?
>
>
> Kind regards,
> Krzysztof
>
>
> ________________________________
> From: Kartoglu, Emre <kar...@amazon.co.uk.inva 
> <mailto:kar...@amazon.co.uk.inva> <mailto:
> kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva>>LID>
> Sent: Monday, January 15, 2024 4:59 PM
> To: dev@flink.apache.org <mailto:dev@flink.apache.org> 
> <mailto:dev@flink.apache.org <mailto:dev@flink.apache.org>> <
> dev@flink.apache.org <mailto:dev@flink.apache.org> 
> <mailto:dev@flink.apache.org <mailto:dev@flink.apache.org>>>
> Subject: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard
>
>
> Hello,
>
>
> I’m opening this thread to discuss a FLIP[1] to make data skew more
> visible on Flink Dashboard.
>
>
> Data skew is currently not as visible as it should be. Users have to click
> each operator and check how much data each sub-task is processing and
> compare the sub-tasks against each other. This is especially cumbersome and
> error-prone for jobs with big job graphs and high parallelism. I’m
> proposing this FLIP to improve this.
>
>
> Kind regards,
> Emre
>
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
>  
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard>
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
>  
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard>
> >
>
>
>
>
>
>
>
>
>
>



Reply via email to