Thanks for your reply. However problem with this approach is that you
have to know the set of websites first, where as, we are using a
focused crawling approach to build our vertical - idea being crawler
will be able to determine which outlinks to fetch (or discard).

Another problem with manually preparing seed list form the known site
list is that I am sure to miss lots of small, individual sites - I
wonder how google, msn, yahoo does it - they must be getting list of
from ISPs, hosting providers, etc?

Thanks
Jha,




On Mon, Jun 16, 2008 at 11:15 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> This seems to be a common request - sizing.  I think the best you can do is 
> use existing search engines to estimate how many pages sites you are 
> interested in have.  You will have to know the exact sites (their URLs) and 
> make use of the "site:" search operator (Google, Yahoo).  Yahoo also has 
> something called SiteExplorer that might help.  Getting the seed list is 
> typically a (semi-)manual process.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> ----- Original Message ----
>> From: DS jha <[EMAIL PROTECTED]>
>> To: [email protected]
>> Sent: Monday, June 16, 2008 11:04:06 PM
>> Subject: getting seed list for vertical search engine
>>
>> Hello,
>> We are in the process of developing a vertical search engine for the
>> medical industry – and I need to estimate server/sizing requirements
>> to setup my environment – my question is, how do I estimate how many
>> documents I will be fetching for a particular vertical?  And – from
>> where do I get the seed list of all the sites? Will dmoz health
>> category be sufficient or will I have to purchase a seed list?
>>
>> Thanks
>
>

Reply via email to