>nowadays many pages freely mix in markup in the main content area...
Yes, but if that content was nested in a larger block of content, then
it would be included.

I will probably end up implmenting some of these algorithms, but I would
like some good feedback before I go out on a limb.


-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 10, 2006 2:51 PM
To: nutch-dev@lucene.apache.org
Subject: Re: quality of search text


Richard Braman wrote:
> Here is a potential algorithm:
>
> Look first to Meta Description, if none exists
> Look for continuous block of text, ignore content that doesn't contain

> a continous block of text.  If a given html tag only contains a few 
> words of text, it is not content , but rather a part of the nav 
> structure of the page.
>
>   

You may potentially miss a lot of content this way, nowadays many pages 
freely mix in markup in the main content area...

> Here is yet another algorithm.
>
> When fetching pages from a particular web, analyze the structure of 
> the page, try to make a determination of what content stays similar 
> from page to page within the same web.  That would usually be menus, 
> headers, footers, etc.
>   

This requires collecting pages in advance to train the structure 
recognizer, and preparing "profiles" for groups of pages with common
layout.

> Granted the menus may change slightly from page to page, which is why 
> the algorithm would be pattern based instead of literal. When you 
> determine what is navigation and what is content, you would only parse

> and index the content.
>
> I think algortihm # 1 is what google uses.
> google ignores content that does not change from page to page, as well

> as content that isn't part of a pblock of text.
>
> Comments please
>   

The best way to evaluate this would be to ..erhm.. evaluate these 
algorithms on a set of reference pages. Would you like to implement one 
or both algorithms and test them?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||
\|  ||  |  Embedded Unix, System Integration http://www.sigram.com
Contact: info at sigram dot com

Reply via email to