Lewis John McGibbney created NUTCH-2909:
-------------------------------------------

             Summary: Standardize Nutch Metrics Counters
                 Key: NUTCH-2909
                 URL: https://issues.apache.org/jira/browse/NUTCH-2909
             Project: Nutch
          Issue Type: Improvement
          Components: metrics
    Affects Versions: 1.18
            Reporter: Lewis John McGibbney
            Assignee: Lewis John McGibbney
             Fix For: 1.19


I revisited Nutch metrics counters and put some [metrics 
documentation|https://cwiki.apache.org/confluence/display/NUTCH/Metrics] 
together for others to consult should they wish.

I thought a comprehensive collection of all Nutch Counters would be useful so I 
put together a [metrics 
table|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-MetricsTable].
 One of this (unintended) outcomes was that this highlighted the variability in 
counter group names and metric names. For example

*Metric Group*:
* _CleaningJobStatus_ - upper camel case
* _CrawlDB filter_ - inconsistent use of capitalization and space separated
* N/A - the DomainStatistics counters don't belong to a metric group
* _injector_ - lowercase named after the encapsulating Class
* _WebGraph.outlinks_ - inconsistent use of capitalization and period separated

The *Metric Name*'s are basically the same... pretty much all over the place.

I am keen to bring some convention to the Nutch metrics definitions but this is 
not all plain sailing. I do understand that existing users may rely upon the 
above metrics as are and changing the values would have impacts downstream.

*PROPOSAL*
I would like to discuss introducing a naming convention which follows some 
simple principles motivated by a [Datadog employees response on 
SO|https://stackoverflow.com/a/18131221].

As a take on that post, I want to propose the following

{quote}
1. With regards to *Metric Group* the highest level of hierarchy is the product 
line or the process i.e., _*nutch*_. The highest level of hierarchy is always 
lowercase.
2. The next level of hierarchy is the sub-component/tool, i.e., 
*_nutch.Injector_*, *_nutch.Generator_*, *_nutch.ParseSegment_*, 
*_nutch.SitemapProcessor_*, etc. This constituent is exactly as that of the 
enclosing Class. This way it is really simple to trace the metric back to the 
Class which it was defined within.
3. The third level of the hierarchy is the metric group which is a general 
grouping of functionality for the metric being defined i.e. 
*_nutch.QueueFeeder.fetcher_status_*. This constituent is lowercase with words 
separated by underscore. If no obvious metric group exists simply provide the 
enclosing Class in lowercase i.e.,  *_nutch.Injector.injector.urls_filtered_*
4. With regards to the *Metric Name*, the last level of hierarchy is the thing 
being measured i.e., *_urls_filtered_*, *_above_exception_threshold_in_queue_*, 
etc. Everything is lowercase and words separated by underscore. Same as #3 
above.

Example complete metrics

* *_nutch.Injector.injector.urls_filtered_*
* *_nutch.ResolverThread.update_host_db.checked_hosts_*
* *_nutch.WebGraph.outlinks.added links_*
{quote}

It would be greatly appreciated if folks could chime in on the details of the 
proposal. I'm sure there are several areas which could be improved. 

I will mention that my specific driver for cleaning this up is that I would 
like to push Nutch metrics into Enterprise Splunk so that the Nutch crawler 
subsystem will be integrated with all the rest of the subsystems I am 
responsible for. We use Splunk for that kind of thing. I intend to do that by 
implementing the [Java statsd 
client|https://github.com/DataDog/java-dogstatsd-client] but I feel that comes 
after we clean up metrics and establish a metrics naming convention.

Thanks for any input. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to