> I was thinking about using the existing numRecordsInPerSecond metric numRecordsInPerSecond sounds make sense to me, and I think it's necessary to mention it in the FLIP wiki. It will let other developers to easily understand. WDYT?
BTW, that's why I ask whether the data skew score means total receive records. > this would always give you a score higher than 1, with no way to cap the score. Yeah, you are right. max/mean is not a score, it's the data skew multiple. And I guess max/mean is easier to understand than Average_absolute_deviation. > I'm more used to working with percentages. The problem with the max/mean metric is I wouldn't immediately know whether a score of 300 is bad for instance. > Whereas if users saw above 50% as suggested in the FLIP for instance, they would consider taking action. I'm tempted to push back on this suggestion. Happy to discuss further, there is a chance I'm not seeing the downside of the proposed percentage based metric yet. Please let me know. After I detailed read the FLIP and Average_absolute_deviation, we know 0% is the best, 100% is worst. I guess it is difficult for users who have not read the documentation to know the meaning of 50%. We hope that the designed Data skew will be easy for users to understand without reading or learning a series of backgrounds. For example, as you mentioned before, flink has a metric: numRecordsInPerSecond. I believe users know what numRecordsInPerSecond means even if they didn't read any documentation. Of course, I'm opening for it. I may have missed something. I'd like to hear more feedback from the community. Best, Rui On Thu, Feb 1, 2024 at 4:13 AM Kartoglu, Emre <kar...@amazon.co.uk.invalid> wrote: > Hi Rui, > > " and provide the total and current score in the detailed tab. I didn't > see the detailed design in the FLIP, would you mind > improve the design doc? Thanks". > > It will essentially be a basic list view similar to the "Checkpoints" tab. > I only briefly mentioned this in the FLIP because it will be a basic list > view. > No problem though, I will update the FLIP. > > > Please find my responses below quotations. > > " 1. About the current skew score, I still don't understand how to get > the list_of_number_of_records_received_by_each_subtask for > each subtask. > > the list_of_number_of_records_received_by_each_subtask of subtask1 > is > > total received records of subtask 1 from beginning to now - > total received records of subtask 1 from beginning to (now - 1min), right?" > > Yes, essentially correct. I was thinking about using the existing > numRecordsInPerSecond metric (see > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/), > this would give us per second granularity and this would be more > "current/live" than per minute. > > > "IIUC, you proposed score is between 0% to 100%, and 0% is the best. > And the 100% is the worst." > > Correct. > > > " For data skew, I'm not sure whether a multiple value is more intuitive. > It means data skew score = max / mean. > The data skew score is between 1 and infinity. 1 is the best, and > the bigger the worse." > > I'm not sure I follow you here. Yes, this would always give you a score > higher than 1, with no way to cap the score. > I'm more used to working with percentages. The problem with the max/mean > metric is I wouldn't immediately know whether a score of 300 is bad for > instance. > Whereas if users saw above 50% as suggested in the FLIP for instance, they > would consider taking action. I'm tempted to push back on this suggestion. > Happy to discuss further, there is a chance I'm not seeing the downside of > the proposed percentage based metric yet. Please let me know. > > Kind regards, > Emre > > On 31/01/2024, 10:57, "Rui Fan" <1996fan...@gmail.com <mailto: > 1996fan...@gmail.com>> wrote: > > > CAUTION: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > > > > Sorry for the late reply. > > > > > > So you would have a high data skew while 1 subtask is receiving all the > data, but on average (say over 1-2 days) data skew would come down to 0 > because all subtasks would have received their portion of the data. > > I'm inclined to think that the current proposal might still be fair, as > you do indeed have a skew by definition (but an intentional one). We can > have a few ways forward: > > > > 0) We can keep the behaviour as proposed. My thoughts are that data skew > is data skew, however intentional it may be. It is not necessarily bad, > like in your example. > > > It makes sense to me. Flink should show data skew correctly > regardless of whether the user is intentional or not. > > > > > > 1) Show data skew based on the beginning of time (not a live/current > score). > I mentioned some downsides to this in the FLIP: If you break or fix your > data skew recently, the historical data might hide the recent fix/breakage, > and it is inconsistent with the other metrics shown on the vertices e.g. > Backpressure/Busy metrics show the live/current score. > > > > 2) We can choose not to put data skew score on the vertices on the job > graph. And instead just use the new proposed Data Skew tab which could show > live/current skew score and the total data skew score from the beginning of > job. > > > It makes sense, we can show the current skew score in the DAG WebUI by > default, > and provide the total and current score in the detailed tab. > > > I didn't see the detailed design in the FLIP, would you mind > improve the design doc? Thanks > > > Also, I have 2 questions for now: > > > 1. About the current skew score, I still don't understand how to get > the list_of_number_of_records_received_by_each_subtask for > each subtask. > > > the list_of_number_of_records_received_by_each_subtask of subtask1 > is total received records of subtask 1 from beginning to now - > total received records of subtask 1 from beginning to (now - 1min), right? > > > Note: 1min is an example. 30s or 2min is fine for me. > > > 2. The skew score is percent > > > I'm not sure whether the score shown in percent format is reasonable. > For busy ratio or backpressure ratio, they are shown in percent format > is intuitive. > > > IIUC, you proposed score is between 0% to 100%, and 0% is the best. > And the 100% is the worst. > > > For data skew, I'm not sure whether a multiple value is more intuitive. > It means data skew score = max / mean. > > > For example, we have 5 subtasks, the received record numbers are > [10,10, 10, 100, 10]. > data skew score = max / mean = 100 / (140/5) = 100/ 28 = 3.57. > > > The data skew score is between 1 and infinity. 1 is the best, and > the bigger the worse. > > > Looking forward to your opinions. > > > Best, > Rui > > > On Tue, Jan 23, 2024 at 6:41 PM Kartoglu, Emre <kar...@amazon.co.uk.inva > <mailto:kar...@amazon.co.uk.inva>lid> > wrote: > > > > Hi Krzysztof, > > > > Thank you for the feedback! Please find my comments below. > > > > 1. Configurability > > > > Adding a feature flag / configuration to enable this is still on the > table > > as far as I am concerned. However I believe adding a new metric shouldn't > > warrant a flag/configuration. One might argue that we should have it for > > showing the metrics on the Flink UI, and I'd appreciate input on this. My > > default position is to not have a configuration/flag unless there is a > good > > reason (e.g. it turns out there is impact on Flink UI for so far unknown > > reason). This is because the proposed change should only be improving the > > experience without any unwanted side effect. > > > > 2. Metrics > > > > I agree the new metrics should be compatible with the rest of the Flink > > metric reporting mechanism. I will update the FLIP and propose names for > > the metrics. > > > > Kind regards, > > Emre > > > > On 23/01/2024, 10:31, "Krzysztof Dziołak" <kdzio...@live.com <mailto: > kdzio...@live.com> <mailto: > > kdzio...@live.com <mailto:kdzio...@live.com>>> wrote: > > > > > > CAUTION: This email originated from outside of the organization. Do not > > click links or open attachments unless you can confirm the sender and > know > > the content is safe. > > > > > > > > > > > > > > Hi Emre, > > > > > > Thank you for driving this proposal. I've got two questions about the > > extensions to the proposal that are not captured in the FLIP. > > > > > > > > > > 1. Configurability - what kind of configuration would you propose to > > maintain for this feature? Would On/off switch and/or aggregated period > > length be configurable? Should we capture the toggles in the FLIP ? > > 2. Metrics - are we planning to emit the skew metric via metric reporters > > mechanism. Should we capture proposed metric schema in the FLIP ? > > > > > > Kind regards, > > Krzysztof > > > > > > ________________________________ > > From: Kartoglu, Emre <kar...@amazon.co.uk.inva <mailto: > kar...@amazon.co.uk.inva> <mailto: > > kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva>>LID> > > Sent: Monday, January 15, 2024 4:59 PM > > To: dev@flink.apache.org <mailto:dev@flink.apache.org> <mailto: > dev@flink.apache.org <mailto:dev@flink.apache.org>> < > > dev@flink.apache.org <mailto:dev@flink.apache.org> <mailto: > dev@flink.apache.org <mailto:dev@flink.apache.org>>> > > Subject: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard > > > > > > Hello, > > > > > > I’m opening this thread to discuss a FLIP[1] to make data skew more > > visible on Flink Dashboard. > > > > > > Data skew is currently not as visible as it should be. Users have to > click > > each operator and check how much data each sub-task is processing and > > compare the sub-tasks against each other. This is especially cumbersome > and > > error-prone for jobs with big job graphs and high parallelism. I’m > > proposing this FLIP to improve this. > > > > > > Kind regards, > > Emre > > > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > < > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > > > > < > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > < > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > > > > > > > > > > > > > > > > > > > > > > > > > > > > >