[ https://issues.apache.org/jira/browse/NUTCH-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881791#comment-17881791 ]
Sebastian Nagel commented on NUTCH-3059: ---------------------------------------- Ok, found the reason: it's because of [MultipleOutputs|https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html] used by Generator to write one output per segment and fetch list partition. We could set {code} MultipleOutputs.setCountersEnabled(job, true); {code} which would add one counter for each segment: {noformat} $> nutch generate crawldb segments -topN 1000 -numFetchers 3 -maxNumSegments 2 org.apache.hadoop.mapreduce.lib.output.MultipleOutputs fetchlist-1/part=1000 fetchlist-2/part=399 {noformat} and a segments/ directory tree (after the partition job): {noformat} segments/ |---20240914162841/ | `-crawl_generate/ | |-part-r-00000 | |-part-r-00001 | `-part-r-00002 `---20240914162906/ `-crawl_generate/ |-part-r-00000 |-part-r-00001 `-part-r-00002 {noformat} Any thoughts or objections? Otherwise, I would open a PR... > Generator: selector job does not count reduce output records > ------------------------------------------------------------ > > Key: NUTCH-3059 > URL: https://issues.apache.org/jira/browse/NUTCH-3059 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 1.20 > Reporter: Sebastian Nagel > Priority: Minor > Fix For: 1.21 > > > The selector step (job) of the Generator does not count the reduce output > records resp. shows the count "0": > {noformat} > 2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: starting > 2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: selecting > best-scoring urls due for fetch. > ... > Map-Reduce Framework > Map input records=6 > Map output records=6 > ... > Combine input records=0 > Combine output records=0 > Reduce input groups=1 > Reduce shuffle bytes=594 > Reduce input records=6 > Reduce output records=0 > Spilled Records=12 > ... > {noformat} > Not a big issue but should investigate why this happens. The other counters > seem to work properly, also the partitioner job shows the reduce output > records. The issue is observed in local and distributed mode. -- This message was sent by Atlassian Jira (v8.20.10#820010)