Dennis Kubes wrote:
In the beginning it is approximately 10 to 1. So for every page I crawl I will get 10 more pages to crawl that are not currently in the index. As you move towards 50 million pages is becomes more like 6 to 1. If you seed the entire dmoz, your first crawl will be around 5.5 million pages. Your second crawl will be around 54 million pages. And a depth of 3 will give you over 300 million pages. These are the numbers that we are currently seeing.

Be advised though that any crawl run that collects more than 1 mln pages is bound to collect a LOT of utter junk and spam - unless you tightly control the quality of URLs, using URLFilters, ScoringFilters and other means.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to