In addition to the two popular approaches, 1) crawling a predetermined
list of web sites, and 2) generic crawling augmented with classifying
web pages into topics or domains, there is another approach 3) focused
crawling.

Let's say you want to crawl 20 million pages. Coming up with the list
by hand would be impractical. Crawling the web in general and then
using a classifier to pick the 20 million that match your target
subject may also be impractical, especially if only a small percentage
(e.g., 5%) of all web pages falls into your domain. A focused crawler
starts with a seed set of pages (say 100-1000 pages) that or manually
collected. From this seed set, the crawler extracts URLs and downloads
pages. However, a page is downloaded only if there is a good chance
the it will be relevant to the subject. A naïve Bayes classifier, or
another classifier, can be used to make the prediction. This approach,
with some more detail, is described in:

http://www2006.org/programme/item.php?id=4512

Cheers,

Tony A.A.

On 1/30/07, Dennis Kubes wrote:
It means searching a specific domain such as automotive, health, etc.

How to do it is another story, short answer you could either index only
specific sites that you know are in the domain or you could create ways
to determine automatically if a page is in a domain.

Dennis Kubes

Reddeppa Naidu wrote:
> Hi,
> i am new to Nutch search, i am working from past one  month.Any one can
> tell what is ment by Vertical search.any one can suggest how can i do it.
>
> Thanks
> pandu
>

Reply via email to