In the tutroial on the wiki the depth is not specified and topN=1000. I run 
those commands yesterday and it is still running. Will it index all my urls? My 
seed file has about 20K urls.

Thanks.
Alex.



 


 

-----Original Message-----
From: Marko Bauhardt <m...@101tec.com>
To: nutch-user@lucene.apache.org
Sent: Thu, Aug 20, 2009 12:17 am
Subject: Re: topN value in crawl










On Aug 19, 2009, at 8:42 PM, alx...@aim.com wrote:?
?

>?

>?
?

hi?
?

>?

>?

> Thanks. What if urls in my seed file do not have outlinks, let > say .pdf 
> files. Should I still specify topN variable? All I need is > to index all 
> urls in my seed file. And they are about 1 M.?
?

topN means that your generated shards (segments) contains max. N popular urls 
from your crawldb which are not fetched.?

popular urls means urls with highest score.?
?

You can set the topN to "-1". if you do this then you generate and fetch all 
urls in one shard.?

if you set topN=330.000 then you fetch 330.000 Urls in one shard.?

if you specifiy the depth parameter then you generate depth shards?
?

for example -topN=330.000 -depth=3?

then you generate/fetch/parse/index 3 shards, every shard contains max. 330.000 
urls,  ~990.000 urls.?
?


marko?
?



 

Reply via email to