Jha,

Nutch doesn't include anything that would let it figure out which pages are 
good and should be kept for inclusion in your vertical search, or which should 
be discarded.  One could write a custom plugin that does this type of 
classification, though.


Otis 
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: DS jha <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Tuesday, June 17, 2008 2:11:35 PM
> Subject: Re: getting seed list for vertical search engine
> 
> Thanks for your reply. However problem with this approach is that you
> have to know the set of websites first, where as, we are using a
> focused crawling approach to build our vertical - idea being crawler
> will be able to determine which outlinks to fetch (or discard).
> 
> Another problem with manually preparing seed list form the known site
> list is that I am sure to miss lots of small, individual sites - I
> wonder how google, msn, yahoo does it - they must be getting list of
> from ISPs, hosting providers, etc?
> 
> Thanks
> Jha,
> 
> 
> 
> 
> On Mon, Jun 16, 2008 at 11:15 PM, Otis Gospodnetic
> wrote:
> > This seems to be a common request - sizing.  I think the best you can do is 
> use existing search engines to estimate how many pages sites you are 
> interested 
> in have.  You will have to know the exact sites (their URLs) and make use of 
> the 
> "site:" search operator (Google, Yahoo).  Yahoo also has something called 
> SiteExplorer that might help.  Getting the seed list is typically a 
> (semi-)manual process.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> > ----- Original Message ----
> >> From: DS jha 
> >> To: [email protected]
> >> Sent: Monday, June 16, 2008 11:04:06 PM
> >> Subject: getting seed list for vertical search engine
> >>
> >> Hello,
> >> We are in the process of developing a vertical search engine for the
> >> medical industry – and I need to estimate server/sizing requirements
> >> to setup my environment – my question is, how do I estimate how many
> >> documents I will be fetching for a particular vertical?  And – from
> >> where do I get the seed list of all the sites? Will dmoz health
> >> category be sufficient or will I have to purchase a seed list?
> >>
> >> Thanks
> >
> >

Reply via email to