[
https://issues.apache.org/jira/browse/NUTCH-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881791#comment-17881791
]
Sebastian Nagel commented on NUTCH-3059:
----------------------------------------
Ok, found the reason: it's because of
[MultipleOutputs|https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html]
used by Generator to write one output per segment and fetch list partition.
We could set
{code}
MultipleOutputs.setCountersEnabled(job, true);
{code}
which would add one counter for each segment:
{noformat}
$> nutch generate crawldb segments -topN 1000 -numFetchers 3 -maxNumSegments 2
org.apache.hadoop.mapreduce.lib.output.MultipleOutputs
fetchlist-1/part=1000
fetchlist-2/part=399
{noformat}
and a segments/ directory tree (after the partition job):
{noformat}
segments/
|---20240914162841/
| `-crawl_generate/
| |-part-r-00000
| |-part-r-00001
| `-part-r-00002
`---20240914162906/
`-crawl_generate/
|-part-r-00000
|-part-r-00001
`-part-r-00002
{noformat}
Any thoughts or objections? Otherwise, I would open a PR...
> Generator: selector job does not count reduce output records
> ------------------------------------------------------------
>
> Key: NUTCH-3059
> URL: https://issues.apache.org/jira/browse/NUTCH-3059
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 1.20
> Reporter: Sebastian Nagel
> Priority: Minor
> Fix For: 1.21
>
>
> The selector step (job) of the Generator does not count the reduce output
> records resp. shows the count "0":
> {noformat}
> 2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: starting
> 2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: selecting
> best-scoring urls due for fetch.
> ...
> Map-Reduce Framework
> Map input records=6
> Map output records=6
> ...
> Combine input records=0
> Combine output records=0
> Reduce input groups=1
> Reduce shuffle bytes=594
> Reduce input records=6
> Reduce output records=0
> Spilled Records=12
> ...
> {noformat}
> Not a big issue but should investigate why this happens. The other counters
> seem to work properly, also the partitioner job shows the reduce output
> records. The issue is observed in local and distributed mode.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)