[jira] [Commented] (NUTCH-2909) Establish a metrics naming convention

Isabelle Giguere (Jira) Mon, 24 Nov 2025 07:48:07 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040366#comment-18040366
 ]


Isabelle Giguere commented on NUTCH-2909:
-----------------------------------------

A friend recently suggested that I try Cursor: https://cursor.com/

This friend doesn't code at all, so he uses it extensively.  Myself, I would 
not replace my own coding skills with AI !!

However, I did test it.  Knowing that Nutch metrics are not optimal, I pointed 
the tool to the Nutch code base, and prompted it with this instruction:

"Find all code where hadoop metrics are calculated and suggest improvements"

It provided a full report, in Makdown format.  We don't need to observe the 
tools' suggested naming convention, and or course, all suggestions should to be 
re-evaluted by humans, it may not all be relevant.  But I'm attaching the 
report, in case it helps guide future improvements.

As I said, I have no intention of using AI to code in my place, but if AI can 
help with some tedious tasks, like going through the whole code just to find 
areas of improvements, then, why not.


> Establish a metrics naming convention
> -------------------------------------
>
>                 Key: NUTCH-2909
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2909
>             Project: Nutch
>          Issue Type: Improvement
>          Components: metrics
>    Affects Versions: 1.18
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.22
>
>         Attachments: HADOOP_METRICS_ANALYSIS.md
>
>
> I revisited Nutch metrics counters and put some [metrics 
> documentation|https://cwiki.apache.org/confluence/display/NUTCH/Metrics] 
> together for others to consult should they wish.
> I thought a comprehensive collection of all Nutch Counters would be useful so 
> I put together a [metrics 
> table|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-MetricsTable].
>  One of this (unintended) outcomes was that this highlighted the variability 
> in counter group names and metric names. For example
> *Metric Group*:
> * _CleaningJobStatus_ - upper camel case
> * _CrawlDB filter_ - inconsistent use of capitalization and space separated
> * N/A - the DomainStatistics counters don't belong to a metric group
> * _injector_ - lowercase named after the encapsulating Class
> * _WebGraph.outlinks_ - inconsistent use of capitalization and period 
> separated
> The *Metric Name*'s are basically the same... pretty much all over the place.
> I am keen to bring some convention to the Nutch metrics definitions but this 
> is not all plain sailing. I do understand that existing users may rely upon 
> the above metrics as are and changing the values would have impacts 
> downstream.
> *PROPOSAL*
> I would like to discuss introducing a naming convention which follows some 
> simple principles motivated by a [Datadog employees response on 
> SO|https://stackoverflow.com/a/18131221].
> As a take on that post, I want to propose the following
> {quote}
> 1. With regards to *Metric Group* the highest level of hierarchy is the 
> product line or the process i.e., _*nutch*_. The highest level of hierarchy 
> is always lowercase.
> 2. The next level of hierarchy is the sub-component/tool, i.e., 
> *_nutch.Injector_*, *_nutch.Generator_*, *_nutch.ParseSegment_*, 
> *_nutch.SitemapProcessor_*, etc. This constituent is exactly as that of the 
> enclosing Class. This way it is really simple to trace the metric back to the 
> Class which it was defined within.
> 3. The third level of the hierarchy is the metric group which is a general 
> grouping of functionality for the metric being defined i.e. 
> *_nutch.QueueFeeder.fetcher_status_*. This constituent is lowercase with 
> words separated by underscore. If no obvious metric group exists simply 
> provide the enclosing Class in lowercase i.e.,  
> *_nutch.Injector.injector.urls_filtered_*
> 4. With regards to the *Metric Name*, the last level of hierarchy is the 
> thing being measured i.e., *_urls_filtered_*, 
> *_above_exception_threshold_in_queue_*, etc. Everything is lowercase and 
> words separated by underscore. Same as #3 above.
> Example complete metrics
> * *_nutch.Injector.injector.urls_filtered_*
> * *_nutch.ResolverThread.update_host_db.checked_hosts_*
> * *_nutch.WebGraph.outlinks.added links_*
> {quote}
> It would be greatly appreciated if folks could chime in on the details of the 
> proposal. I'm sure there are several areas which could be improved. 
> I will mention that my specific driver for cleaning this up is that I would 
> like to push Nutch metrics into Enterprise Splunk so that the Nutch crawler 
> subsystem will be integrated with all the rest of the subsystems I am 
> responsible for. We use Splunk for that kind of thing. I intend to do that by 
> implementing the [Java statsd 
> client|https://github.com/DataDog/java-dogstatsd-client] but I feel that 
> comes after we clean up metrics and establish a metrics naming convention.
> Thanks for any input. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-2909) Establish a metrics naming convention

Reply via email to