On Aug 19, 2009, at 8:42 PM, alx...@aim.com wrote:


Thanks. What if urls in my seed file do not have outlinks, let say .pdf files. Should I still specify topN variable? All I need is to index all urls in my seed file. And they are about 1 M.

topN means that your generated shards (segments) contains max. N popular urls from your crawldb which are not fetched.
popular urls means urls with highest score.

You can set the topN to "-1". if you do this then you generate and fetch all urls in one shard.
if you set topN=330.000 then you fetch 330.000 Urls in one shard.
if you specifiy the depth parameter then you generate depth shards

for example -topN=330.000 -depth=3
then you generate/fetch/parse/index 3 shards, every shard contains max. 330.000 urls, ~990.000 urls.


Reply via email to