Thanks for your reply. However problem with this approach is that you have to know the set of websites first, where as, we are using a focused crawling approach to build our vertical - idea being crawler will be able to determine which outlinks to fetch (or discard).
Another problem with manually preparing seed list form the known site list is that I am sure to miss lots of small, individual sites - I wonder how google, msn, yahoo does it - they must be getting list of from ISPs, hosting providers, etc? Thanks Jha, On Mon, Jun 16, 2008 at 11:15 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > This seems to be a common request - sizing. I think the best you can do is > use existing search engines to estimate how many pages sites you are > interested in have. You will have to know the exact sites (their URLs) and > make use of the "site:" search operator (Google, Yahoo). Yahoo also has > something called SiteExplorer that might help. Getting the seed list is > typically a (semi-)manual process. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > ----- Original Message ---- >> From: DS jha <[EMAIL PROTECTED]> >> To: [email protected] >> Sent: Monday, June 16, 2008 11:04:06 PM >> Subject: getting seed list for vertical search engine >> >> Hello, >> We are in the process of developing a vertical search engine for the >> medical industry – and I need to estimate server/sizing requirements >> to setup my environment – my question is, how do I estimate how many >> documents I will be fetching for a particular vertical? And – from >> where do I get the seed list of all the sites? Will dmoz health >> category be sufficient or will I have to purchase a seed list? >> >> Thanks > >
