Dennis Kubes wrote:
In the beginning it is approximately 10 to 1. So for every page I crawl I will get 10 more pages to crawl that are not currently in the index. As you move towards 50 million pages is becomes more like 6 to 1. If you seed the entire dmoz, your first crawl will be around 5.5 million pages. Your second crawl will be around 54 million pages. And a depth of 3 will give you over 300 million pages. These are the numbers that we are currently seeing.
Be advised though that any crawl run that collects more than 1 mln pages is bound to collect a LOT of utter junk and spam - unless you tightly control the quality of URLs, using URLFilters, ScoringFilters and other means.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
