Thanks Emre for the feedback! I still think max/mean is more simple and easy to understand for users. But I don’t have a strong opinion about it.
This proposal is absolutely useful for flink users! In order to ensure the value for users, would you mind if we wait for a while and check if there is more feedback from the community. Also, would you mind sharing these 2 solutions to the user[1] & user-zh[2] mail list as well? Flink users may give some valuable feedback there, thanks~ [1] u...@flink.apache.org [2] user...@flink.apache.org Best, Rui On Thu, Feb 1, 2024 at 5:52 PM Kartoglu, Emre <kar...@amazon.co.uk.invalid> wrote: > Hi Rui, > > Thanks for the useful feedback and caring about the user experience. > I will update the FLIP based on 1 comment. I consider this a minor update. > > Please find my detailed responses below. > > "numRecordsInPerSecond sounds make sense to me, and I think > it's necessary to mention it in the FLIP wiki. It will let other developers > to easily understand. WDYT?" > > I feel like this might be touching implementation details. No objections > though, > I will update the FLIP with this as one of the ways in which we can > achieve the proposal. > > > "After I detailed read the FLIP and Average_absolute_deviation, we know > 0% is the best, 100% is worst." > > Correct. > > > "I guess it is difficult for users who have not read the documentation to > know the meaning of 50%. We hope that the designed Data skew will > be easy for users to understand without reading or learning a series > of backgrounds." > > I think I understand where you're coming from. My thought is that the user > won't have to > know exactly how the skew percentage/score is calculated. But this score > will > act as a warning sign for them. Upon seeing a skew score of 80% for an > operator, as a user > I will go and click on the operator to see many of my subtasks are not > receiving any data at all. > So it acts as a metric to get the user's attention to the skewed operator > and fix issues. > > > "For example, as you mentioned before, flink has a metric: > numRecordsInPerSecond. > I believe users know what numRecordsInPerSecond means even if they > didn't read any documentation." > > The FLIP suggests that we will provide an explanation of the data skew > score > under the proposed Data Skew tab. I would like the exact wording to be > left to > the code review process to prevent these from blocking the implementation > work/progress. > This will be a user-friendly explanation with an option for the curious > user to see the exact formula. > > > Kind regards, > Emre > > > On 01/02/2024, 03:26, "Rui Fan" <1996fan...@gmail.com <mailto: > 1996fan...@gmail.com>> wrote: > > > CAUTION: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > > > > > I was thinking about using the existing numRecordsInPerSecond metric > > > numRecordsInPerSecond sounds make sense to me, and I think > it's necessary to mention it in the FLIP wiki. It will let other developers > to easily understand. WDYT? > > > BTW, that's why I ask whether the data skew score means total > receive records. > > > > this would always give you a score higher than 1, with no way to cap the > score. > > > Yeah, you are right. max/mean is not a score, it's the data skew multiple. > And I guess max/mean is easier to understand than > Average_absolute_deviation. > > > > I'm more used to working with percentages. The problem with the max/mean > metric is I wouldn't immediately know whether a score of 300 is bad for > instance. > > Whereas if users saw above 50% as suggested in the FLIP for instance, > they would consider taking action. I'm tempted to push back on this > suggestion. Happy to discuss further, there is a chance I'm not seeing the > downside of the proposed percentage based metric yet. Please let me know. > > > After I detailed read the FLIP and Average_absolute_deviation, we know > 0% is the best, 100% is worst. > > > I guess it is difficult for users who have not read the documentation to > know the meaning of 50%. We hope that the designed Data skew will > be easy for users to understand without reading or learning a series > of backgrounds. > > > For example, as you mentioned before, flink has a metric: > numRecordsInPerSecond. > I believe users know what numRecordsInPerSecond means even if they > didn't read any documentation. > > > Of course, I'm opening for it. I may have missed something. I'd like to > hear > more feedback from the community. > > > Best, > Rui > > > On Thu, Feb 1, 2024 at 4:13 AM Kartoglu, Emre <kar...@amazon.co.uk.inva > <mailto:kar...@amazon.co.uk.inva>lid> > wrote: > > > > Hi Rui, > > > > " and provide the total and current score in the detailed tab. I didn't > > see the detailed design in the FLIP, would you mind > > improve the design doc? Thanks". > > > > It will essentially be a basic list view similar to the "Checkpoints" > tab. > > I only briefly mentioned this in the FLIP because it will be a basic list > > view. > > No problem though, I will update the FLIP. > > > > > > Please find my responses below quotations. > > > > " 1. About the current skew score, I still don't understand how to get > > the list_of_number_of_records_received_by_each_subtask for > > each subtask. > > > > the list_of_number_of_records_received_by_each_subtask of subtask1 > > is > > > > total received records of subtask 1 from beginning to now - > > total received records of subtask 1 from beginning to (now - 1min), > right?" > > > > Yes, essentially correct. I was thinking about using the existing > > numRecordsInPerSecond metric (see > > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/ < > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/>), > > this would give us per second granularity and this would be more > > "current/live" than per minute. > > > > > > "IIUC, you proposed score is between 0% to 100%, and 0% is the best. > > And the 100% is the worst." > > > > Correct. > > > > > > " For data skew, I'm not sure whether a multiple value is more intuitive. > > It means data skew score = max / mean. > > The data skew score is between 1 and infinity. 1 is the best, and > > the bigger the worse." > > > > I'm not sure I follow you here. Yes, this would always give you a score > > higher than 1, with no way to cap the score. > > I'm more used to working with percentages. The problem with the max/mean > > metric is I wouldn't immediately know whether a score of 300 is bad for > > instance. > > Whereas if users saw above 50% as suggested in the FLIP for instance, > they > > would consider taking action. I'm tempted to push back on this > suggestion. > > Happy to discuss further, there is a chance I'm not seeing the downside > of > > the proposed percentage based metric yet. Please let me know. > > > > Kind regards, > > Emre > > > > On 31/01/2024, 10:57, "Rui Fan" <1996fan...@gmail.com <mailto: > 1996fan...@gmail.com> <mailto: > > 1996fan...@gmail.com <mailto:1996fan...@gmail.com>>> wrote: > > > > > > CAUTION: This email originated from outside of the organization. Do not > > click links or open attachments unless you can confirm the sender and > know > > the content is safe. > > > > > > > > > > > > > > Sorry for the late reply. > > > > > > > > > > > So you would have a high data skew while 1 subtask is receiving all the > > data, but on average (say over 1-2 days) data skew would come down to 0 > > because all subtasks would have received their portion of the data. > > > I'm inclined to think that the current proposal might still be fair, as > > you do indeed have a skew by definition (but an intentional one). We can > > have a few ways forward: > > > > > > 0) We can keep the behaviour as proposed. My thoughts are that data > skew > > is data skew, however intentional it may be. It is not necessarily bad, > > like in your example. > > > > > > It makes sense to me. Flink should show data skew correctly > > regardless of whether the user is intentional or not. > > > > > > > > > > > 1) Show data skew based on the beginning of time (not a live/current > > score). > > I mentioned some downsides to this in the FLIP: If you break or fix your > > data skew recently, the historical data might hide the recent > fix/breakage, > > and it is inconsistent with the other metrics shown on the vertices e.g. > > Backpressure/Busy metrics show the live/current score. > > > > > > 2) We can choose not to put data skew score on the vertices on the job > > graph. And instead just use the new proposed Data Skew tab which could > show > > live/current skew score and the total data skew score from the beginning > of > > job. > > > > > > It makes sense, we can show the current skew score in the DAG WebUI by > > default, > > and provide the total and current score in the detailed tab. > > > > > > I didn't see the detailed design in the FLIP, would you mind > > improve the design doc? Thanks > > > > > > Also, I have 2 questions for now: > > > > > > 1. About the current skew score, I still don't understand how to get > > the list_of_number_of_records_received_by_each_subtask for > > each subtask. > > > > > > the list_of_number_of_records_received_by_each_subtask of subtask1 > > is total received records of subtask 1 from beginning to now - > > total received records of subtask 1 from beginning to (now - 1min), > right? > > > > > > Note: 1min is an example. 30s or 2min is fine for me. > > > > > > 2. The skew score is percent > > > > > > I'm not sure whether the score shown in percent format is reasonable. > > For busy ratio or backpressure ratio, they are shown in percent format > > is intuitive. > > > > > > IIUC, you proposed score is between 0% to 100%, and 0% is the best. > > And the 100% is the worst. > > > > > > For data skew, I'm not sure whether a multiple value is more intuitive. > > It means data skew score = max / mean. > > > > > > For example, we have 5 subtasks, the received record numbers are > > [10,10, 10, 100, 10]. > > data skew score = max / mean = 100 / (140/5) = 100/ 28 = 3.57. > > > > > > The data skew score is between 1 and infinity. 1 is the best, and > > the bigger the worse. > > > > > > Looking forward to your opinions. > > > > > > Best, > > Rui > > > > > > On Tue, Jan 23, 2024 at 6:41 PM Kartoglu, Emre <kar...@amazon.co.uk.inva > <mailto:kar...@amazon.co.uk.inva> > > <mailto:kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva>>lid> > > wrote: > > > > > > > Hi Krzysztof, > > > > > > Thank you for the feedback! Please find my comments below. > > > > > > 1. Configurability > > > > > > Adding a feature flag / configuration to enable this is still on the > > table > > > as far as I am concerned. However I believe adding a new metric > shouldn't > > > warrant a flag/configuration. One might argue that we should have it > for > > > showing the metrics on the Flink UI, and I'd appreciate input on this. > My > > > default position is to not have a configuration/flag unless there is a > > good > > > reason (e.g. it turns out there is impact on Flink UI for so far > unknown > > > reason). This is because the proposed change should only be improving > the > > > experience without any unwanted side effect. > > > > > > 2. Metrics > > > > > > I agree the new metrics should be compatible with the rest of the Flink > > > metric reporting mechanism. I will update the FLIP and propose names > for > > > the metrics. > > > > > > Kind regards, > > > Emre > > > > > > On 23/01/2024, 10:31, "Krzysztof Dziołak" <kdzio...@live.com <mailto: > kdzio...@live.com> <mailto: > > kdzio...@live.com <mailto:kdzio...@live.com>> <mailto: > > > kdzio...@live.com <mailto:kdzio...@live.com> <mailto:kdzio...@live.com > <mailto:kdzio...@live.com>>>> wrote: > > > > > > > > > CAUTION: This email originated from outside of the organization. Do not > > > click links or open attachments unless you can confirm the sender and > > know > > > the content is safe. > > > > > > > > > > > > > > > > > > > > > Hi Emre, > > > > > > > > > Thank you for driving this proposal. I've got two questions about the > > > extensions to the proposal that are not captured in the FLIP. > > > > > > > > > > > > > > > 1. Configurability - what kind of configuration would you propose to > > > maintain for this feature? Would On/off switch and/or aggregated period > > > length be configurable? Should we capture the toggles in the FLIP ? > > > 2. Metrics - are we planning to emit the skew metric via metric > reporters > > > mechanism. Should we capture proposed metric schema in the FLIP ? > > > > > > > > > Kind regards, > > > Krzysztof > > > > > > > > > ________________________________ > > > From: Kartoglu, Emre <kar...@amazon.co.uk.inva <mailto: > kar...@amazon.co.uk.inva> <mailto: > > kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva>> <mailto: > > > kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva> <mailto: > kar...@amazon.co.uk.inva <mailto:kar...@amazon.co.uk.inva>>>LID> > > > Sent: Monday, January 15, 2024 4:59 PM > > > To: dev@flink.apache.org <mailto:dev@flink.apache.org> <mailto: > dev@flink.apache.org <mailto:dev@flink.apache.org>> <mailto: > > dev@flink.apache.org <mailto:dev@flink.apache.org> <mailto: > dev@flink.apache.org <mailto:dev@flink.apache.org>>> < > > > dev@flink.apache.org <mailto:dev@flink.apache.org> <mailto: > dev@flink.apache.org <mailto:dev@flink.apache.org>> <mailto: > > dev@flink.apache.org <mailto:dev@flink.apache.org> <mailto: > dev@flink.apache.org <mailto:dev@flink.apache.org>>>> > > > Subject: [DISCUSS] FLIP-418: Show data skew score on Flink Dashboard > > > > > > > > > Hello, > > > > > > > > > I’m opening this thread to discuss a FLIP[1] to make data skew more > > > visible on Flink Dashboard. > > > > > > > > > Data skew is currently not as visible as it should be. Users have to > > click > > > each operator and check how much data each sub-task is processing and > > > compare the sub-tasks against each other. This is especially cumbersome > > and > > > error-prone for jobs with big job graphs and high parallelism. I’m > > > proposing this FLIP to improve this. > > > > > > > > > Kind regards, > > > Emre > > > > > > > > > [1] > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > < > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > > > > < > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > < > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > > > > > > > > < > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > < > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > > > > < > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > < > https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >