This seems to be a common request - sizing. I think the best you can do is use existing search engines to estimate how many pages sites you are interested in have. You will have to know the exact sites (their URLs) and make use of the "site:" search operator (Google, Yahoo). Yahoo also has something called SiteExplorer that might help. Getting the seed list is typically a (semi-)manual process.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: DS jha <[EMAIL PROTECTED]> > To: [email protected] > Sent: Monday, June 16, 2008 11:04:06 PM > Subject: getting seed list for vertical search engine > > Hello, > We are in the process of developing a vertical search engine for the > medical industry – and I need to estimate server/sizing requirements > to setup my environment – my question is, how do I estimate how many > documents I will be fetching for a particular vertical? And – from > where do I get the seed list of all the sites? Will dmoz health > category be sufficient or will I have to purchase a seed list? > > Thanks
