Lewis John McGibbney created NUTCH-2909:
-------------------------------------------
Summary: Standardize Nutch Metrics Counters
Key: NUTCH-2909
URL: https://issues.apache.org/jira/browse/NUTCH-2909
Project: Nutch
Issue Type: Improvement
Components: metrics
Affects Versions: 1.18
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Fix For: 1.19
I revisited Nutch metrics counters and put some [metrics
documentation|https://cwiki.apache.org/confluence/display/NUTCH/Metrics]
together for others to consult should they wish.
I thought a comprehensive collection of all Nutch Counters would be useful so I
put together a [metrics
table|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-MetricsTable].
One of this (unintended) outcomes was that this highlighted the variability in
counter group names and metric names. For example
*Metric Group*:
* _CleaningJobStatus_ - upper camel case
* _CrawlDB filter_ - inconsistent use of capitalization and space separated
* N/A - the DomainStatistics counters don't belong to a metric group
* _injector_ - lowercase named after the encapsulating Class
* _WebGraph.outlinks_ - inconsistent use of capitalization and period separated
The *Metric Name*'s are basically the same... pretty much all over the place.
I am keen to bring some convention to the Nutch metrics definitions but this is
not all plain sailing. I do understand that existing users may rely upon the
above metrics as are and changing the values would have impacts downstream.
*PROPOSAL*
I would like to discuss introducing a naming convention which follows some
simple principles motivated by a [Datadog employees response on
SO|https://stackoverflow.com/a/18131221].
As a take on that post, I want to propose the following
{quote}
1. With regards to *Metric Group* the highest level of hierarchy is the product
line or the process i.e., _*nutch*_. The highest level of hierarchy is always
lowercase.
2. The next level of hierarchy is the sub-component/tool, i.e.,
*_nutch.Injector_*, *_nutch.Generator_*, *_nutch.ParseSegment_*,
*_nutch.SitemapProcessor_*, etc. This constituent is exactly as that of the
enclosing Class. This way it is really simple to trace the metric back to the
Class which it was defined within.
3. The third level of the hierarchy is the metric group which is a general
grouping of functionality for the metric being defined i.e.
*_nutch.QueueFeeder.fetcher_status_*. This constituent is lowercase with words
separated by underscore. If no obvious metric group exists simply provide the
enclosing Class in lowercase i.e., *_nutch.Injector.injector.urls_filtered_*
4. With regards to the *Metric Name*, the last level of hierarchy is the thing
being measured i.e., *_urls_filtered_*, *_above_exception_threshold_in_queue_*,
etc. Everything is lowercase and words separated by underscore. Same as #3
above.
Example complete metrics
* *_nutch.Injector.injector.urls_filtered_*
* *_nutch.ResolverThread.update_host_db.checked_hosts_*
* *_nutch.WebGraph.outlinks.added links_*
{quote}
It would be greatly appreciated if folks could chime in on the details of the
proposal. I'm sure there are several areas which could be improved.
I will mention that my specific driver for cleaning this up is that I would
like to push Nutch metrics into Enterprise Splunk so that the Nutch crawler
subsystem will be integrated with all the rest of the subsystems I am
responsible for. We use Splunk for that kind of thing. I intend to do that by
implementing the [Java statsd
client|https://github.com/DataDog/java-dogstatsd-client] but I feel that comes
after we clean up metrics and establish a metrics naming convention.
Thanks for any input.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)