Regarding transfer: I think objects are fine, as long as they are not user-defined objects. We can limit it to String and subclasses of Number.
Regarding traversal of groups: I am still thinking here in terms of the paradigm that the metrics should impact the regular system as little as possible. Shifting work to the "query/dump" action is good in that sense, unless that means permanent re-construction of the name. The metric query endpoint could (should) be a separate actor from the TaskManager, in my opinion. That also solves the issue of blocking the TaskManager actor. BTW: Can the Dumper be simply a special reporter that understands the component metric groups and does not use scope formats? On Tue, Aug 2, 2016 at 3:50 PM, Chesnay Schepler <ches...@apache.org> wrote: > Thank you for your feedback :) > > Regarding names: > > The Dumper does not create a MetricSnapshot. The Dumper creates a > list of key-value pairs; metric_name:value. > A (single) MetricSnapshot exists in the WebRuntimeMonitor, into > which the dumped list is inserted. > > So the dumper creates a snapshot but not a MetricSnapshot, and the > WebRuntimeMonitor contains a MetricSnapshot which isn't really a > snapshot but more a storage. > > The naming isn't the best. > > I'm not sure if "Service" really fits the bill; I associate a > service with separate thread running in the background. > > Regarding merging of metrics: > > We are not merging any metrics right now. While Counters are easy to > merge, for Gauge's we may have to let the user choose in the > WebInterface how they should be aggregated. > > This is /not really/ a problem; in the sense that we don't have > different versions overwriting each other: > > * JM/TM metrics don't have to be merged > * task metrics can be kept on a per subtask/operator level for now > (the prototype exposes them as > "<subtask_index>_<operator_name>_<metric_name>") > * job metrics are currently only gathered on the JM; so no merging > here either > > Regarding transfer: > > Should we transfer numbers as numbers, or also as strings? I'm > concerned about the efficiency of the whole thing; if we send some > metrics as strings and some as numbers we have to decide for every > metric which option we should take. That's why i was wondering > whether to send everything as objects or everything as strings. > > Regarding traversal of groups: > > Yes, we would save on startup/teardown time if we traversed the > groups instead. However the dumping itself should become more > expensive this way; and since this is done by the TaskManager thread > i wanted to keep it as simple as possible. > > Also, there is currently no way to access the metrics contained in a > group. We would have to add another method to the > AbstractMetricGroup, which i would prefer not to do as it can lead > to concurrency issues during teardown. > > > > On 02.08.2016 15:05, Till Rohrmann wrote: > >> The metrics transfer design document looks good to me. Thanks for your >> work >> Chesnay :-) >> >> I think the benefit of registering the metrics at the MetricDumper is that >> we don't have to walk through the hierarchy of metric groups to collect >> the >> metric values. Indeed, this comes with increased costs at start-up. But >> I'm >> not sure what's the concrete impact on job performance in these cases. >> >> Cheers, >> Till >> >> On Tue, Aug 2, 2016 at 8:34 PM, Stephan Ewen <se...@apache.org> wrote: >> >> Hi! >>> >>> Thanks for writing this up. I think it looks quite reasonable (I hope I >>> understood that design correctly) >>> >>> There is one point of confusions left for me, though: The MetricDumper >>> and >>> MetricSnapshot: I think it is just the names that confuse me here. >>> It looks like they define a way to query the metrics in the Metric >>> Registry >>> in a standard schema (independent of the scope formats). >>> Should the "dumper" maybe be called "MetricsQueryService" or so (the >>> query >>> service returns a MetricSnapshot, if I understand correctly). >>> >>> It would be great if the "query service" would not need metrics to be >>> registered - saves us some effort during startup / teardown. It looks >>> as if the query service could just use the the root-most component metric >>> groups to walk the tree of whatever metric is currently there and put it >>> into the current snapshot. >>> >>> One open questions that I have is: How do you know how to merge the >>> metrics >>> from the subtasks, for example in case you want a metric across subtasks. >>> >>> In general, not transferring objects (only strings / numbers) would be >>> preferable, because the WebMonitor may run in an environment where no >>> user-code classloader can be used. >>> It may run in the dispatcher (which must be trusted and cannot execute >>> user >>> code). >>> >>> Greetings, >>> Stephan >>> >>> >>> >>> On Thu, Jul 28, 2016 at 3:12 PM, Chesnay Schepler <ches...@apache.org> >>> wrote: >>> >>> Hello, >>>> >>>> I just created a new FLIP which aims at exposing our metrics to the >>>> WebInterface. >>>> >>>> >>>> >>>> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-7%3A+Expose+metrics+to+WebInterface >>> >>>> Looking forward to feedback :) >>>> >>>> Regards, >>>> Chesnay Schepler >>>> >>>> >