Lewis John McGibbney created NUTCH-3141:
-------------------------------------------

             Summary: Cache Hadoop Counter References in Hot Paths
                 Key: NUTCH-3141
                 URL: https://issues.apache.org/jira/browse/NUTCH-3141
             Project: Nutch
          Issue Type: Sub-task
          Components: metrics
            Reporter: Lewis John McGibbney
            Assignee: Lewis John McGibbney
             Fix For: 1.22


Hadoop's _*context.getCounter(group, name)*_ performs a lookup in an internal 
data structure each time it's called. In hot paths (code executed thousands or 
millions of times during a crawl), these repeated lookups create measurable 
overhead. We implemented cached counter references in FetcherThread earlier. 
This task will expand to cover other classes, namely
||Class||Counter Calls in Hot Path||Impact||
|QueueFeeder.java|4 counters in main loop|High - processes every URL|
|IndexerMapReduce.java|10+ counters in reduce()|High - every indexed doc|
|SitemapProcessor.java|8 counters in mapper|Medium|
|WARCExporter.java|10 counters in reduce|Medium|
|UpdateHostDbMapper.java|3 counters in map()|Medium|
|UpdateHostDbReducer.java|3 counters in reduce()|Medium|

Potential performance impacts/optimizations for a crawl processing 1 million 
URLs:
 * QueueFeeder: Up to 4 million avoided lookups

 * IndexerMapReduce: Up to 10 million avoided lookups (if all paths hit)
 * Total potential: 10-15% reduction in overhead for counter-heavy operations



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to