becketqin commented on pull request #13920: URL: https://github.com/apache/flink/pull/13920#issuecomment-723556790
@StephanEwen Thanks for the comment. I have updated the patch. Would you help take a look? I have also left some thoughts regarding the performance impact. Please let me know if you still have concerns. Just want to add to the performance discussion. The first version of implementation I had has a `DynamicMetricSampler` class which does the following: 1. Take a target metric reporting overhead in percentage, for example, 0.01%. 2. Measure the absolute time it takes to report a metric, e.g. 1000 ns. 3. Based on the overhead and throughput, it calculates the metric sampling interval. In the above case, if the throughput is 1000 records per second, each record takes 1 ms (1,000,000 ns) to process. If the overhead is 0.01%, the budget for metric reporting is 100 ns per record. Given that each metric reporting takes 1000 ns, the sampling interval should be every 10 records. 4. The metric sampling interval is adjusted periodically to reflect the latest throughput. The above logic allows a quantifiable bounded performance impact on the throughput. But I removed because In most cases, periodical reporting is good enough, e.g. reporting metrics every second. So we can avoid some complexity. If you think this approach helps, we can also bring that in. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
