[jira] [Commented] (NUTCH-3059) Generator: selector job does not count reduce output records

Sebastian Nagel (Jira) Sat, 14 Sep 2024 07:46:05 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881791#comment-17881791
 ]


Sebastian Nagel commented on NUTCH-3059:
----------------------------------------

Ok, found the reason: it's because of 
[MultipleOutputs|https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html]
 used by Generator to write one output per segment and fetch list partition.

We could set
{code}
MultipleOutputs.setCountersEnabled(job, true);
{code}
which would add one counter for each segment:
{noformat}
$> nutch generate crawldb segments -topN 1000 -numFetchers 3 -maxNumSegments 2
        org.apache.hadoop.mapreduce.lib.output.MultipleOutputs
                fetchlist-1/part=1000
                fetchlist-2/part=399
{noformat}
and a segments/ directory tree (after the partition job):
{noformat}
segments/
 |---20240914162841/
 |   `-crawl_generate/
 |     |-part-r-00000
 |     |-part-r-00001
 |     `-part-r-00002
 `---20240914162906/
     `-crawl_generate/
       |-part-r-00000
       |-part-r-00001
       `-part-r-00002
{noformat}

Any thoughts or objections? Otherwise, I would open a PR...

> Generator: selector job does not count reduce output records
> ------------------------------------------------------------
>
>                 Key: NUTCH-3059
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3059
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.20
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.21
>
>
> The selector step (job) of the Generator does not count the reduce output 
> records resp. shows the count "0":
> {noformat}
> 2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: starting
> 2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: selecting 
> best-scoring urls due for fetch.
> ...
>          Map-Reduce Framework
>                 Map input records=6
>                 Map output records=6
>                 ...
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=1
>                 Reduce shuffle bytes=594
>                 Reduce input records=6
>                 Reduce output records=0
>                 Spilled Records=12
>                 ...
> {noformat}
> Not a big issue but should investigate why this happens. The other counters 
> seem to work properly, also the partitioner job shows the reduce output 
> records. The issue is observed in local and distributed mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3059) Generator: selector job does not count reduce output records

Reply via email to