[ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107369#comment-13107369 ]
Robert Thomson commented on NUTCH-1074: --------------------------------------- As far as I can tell, when generator.max.count is set, the Generator.Selector reduce function partitions records so that each segment contains up to the set number of entries per host. The relative size of resulting segments will depend on the distribution of hosts in the crawldb. topN only limits the mean size of the segments. If generator.max.count is not set, each segment will contain topN records. > topN is ignored with maxNumSegments > ----------------------------------- > > Key: NUTCH-1074 > URL: https://issues.apache.org/jira/browse/NUTCH-1074 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 1.3 > Reporter: Markus Jelsma > Fix For: 1.4 > > > When generating segments with topN and maxNumSegments, topN is not respected. > It looks like the first generated segment contains topN * maxNumSegments of > URLs's, at least the number of map input records roughly matches. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira