[
https://issues.apache.org/jira/browse/NUTCH-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18047723#comment-18047723
]
ASF GitHub Bot commented on NUTCH-3141:
---------------------------------------
lewismc opened a new pull request, #878:
URL: https://github.com/apache/nutch/pull/878
This is a fairlyt straightforward patch for
[NUTCH-3141](https://issues.apache.org/jira/browse/NUTCH-3141). The cache
pattern is already implemented in FetcherThread so this PR introduces more
consistency across the codebase.
> Cache Hadoop Counter References in Hot Paths
> --------------------------------------------
>
> Key: NUTCH-3141
> URL: https://issues.apache.org/jira/browse/NUTCH-3141
> Project: Nutch
> Issue Type: Sub-task
> Components: metrics
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Major
> Fix For: 1.22
>
>
> Hadoop's _*context.getCounter(group, name)*_ performs a lookup in an internal
> data structure each time it's called. In hot paths (code executed thousands
> or millions of times during a crawl), these repeated lookups create
> measurable overhead. We implemented cached counter references in
> FetcherThread earlier. This task will expand to cover other classes, namely
> ||Class||Counter Calls in Hot Path||Impact||
> |QueueFeeder.java|4 counters in main loop|High - processes every URL|
> |IndexerMapReduce.java|10+ counters in reduce()|High - every indexed doc|
> |SitemapProcessor.java|8 counters in mapper|Medium|
> |WARCExporter.java|10 counters in reduce|Medium|
> |UpdateHostDbMapper.java|3 counters in map()|Medium|
> |UpdateHostDbReducer.java|3 counters in reduce()|Medium|
> Potential performance impacts/optimizations for a crawl processing 1 million
> URLs:
> * QueueFeeder: Up to 4 million avoided lookups
> * IndexerMapReduce: Up to 10 million avoided lookups (if all paths hit)
> * Total potential: 10-15% reduction in overhead for counter-heavy operations
--
This message was sent by Atlassian Jira
(v8.20.10#820010)