[jira] [Commented] (NUTCH-3141) Cache Hadoop Counter References in Hot Paths

Hudson (Jira) Thu, 08 Jan 2026 10:33:57 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050675#comment-18050675
 ]


Hudson commented on NUTCH-3141:
-------------------------------

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #211 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/211/])
NUTCH-3141 Cache Hadoop Counter References in Hot Paths (#878) (github: 
[https://github.com/apache/nutch/commit/66f678e62f57de30e605a1e0d23d7923bf21c780])
* (edit) src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java
* (edit) src/java/org/apache/nutch/fetcher/QueueFeeder.java
* (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java
* (edit) src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* (edit) src/java/org/apache/nutch/hostdb/UpdateHostDbMapper.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDb.java
* (edit) src/java/org/apache/nutch/tools/warc/WARCExporter.java
* (edit) src/java/org/apache/nutch/util/SitemapProcessor.java


> Cache Hadoop Counter References in Hot Paths
> --------------------------------------------
>
>                 Key: NUTCH-3141
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3141
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: metrics
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.22
>
>
> Hadoop's _*context.getCounter(group, name)*_ performs a lookup in an internal 
> data structure each time it's called. In hot paths (code executed thousands 
> or millions of times during a crawl), these repeated lookups create 
> measurable overhead. We implemented cached counter references in 
> FetcherThread earlier. This task will expand to cover other classes, namely
> ||Class||Counter Calls in Hot Path||Impact||
> |QueueFeeder.java|4 counters in main loop|High - processes every URL|
> |IndexerMapReduce.java|10+ counters in reduce()|High - every indexed doc|
> |SitemapProcessor.java|8 counters in mapper|Medium|
> |WARCExporter.java|10 counters in reduce|Medium|
> |UpdateHostDbMapper.java|3 counters in map()|Medium|
> |UpdateHostDbReducer.java|3 counters in reduce()|Medium|
> Potential performance impacts/optimizations for a crawl processing 1 million 
> URLs:
>  * QueueFeeder: Up to 4 million avoided lookups
>  * IndexerMapReduce: Up to 10 million avoided lookups (if all paths hit)
>  * Total potential: 10-15% reduction in overhead for counter-heavy operations



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3141) Cache Hadoop Counter References in Hot Paths

Reply via email to