[ 
https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107369#comment-13107369
 ] 

Robert Thomson commented on NUTCH-1074:
---------------------------------------

As far as I can tell, when generator.max.count is set, the Generator.Selector 
reduce function partitions records so that each segment contains up to the set 
number of entries per host.  The relative size of resulting segments will 
depend on the distribution of hosts in the crawldb.  topN only limits the mean 
size of the segments.

If generator.max.count is not set, each segment will contain topN records.

> topN is ignored with maxNumSegments
> -----------------------------------
>
>                 Key: NUTCH-1074
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1074
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>             Fix For: 1.4
>
>
> When generating segments with topN and maxNumSegments, topN is not respected. 
> It looks like the first generated segment contains topN * maxNumSegments of 
> URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to