>it doesn't say in pages "this is menu, this is body text",
Agreed it doesn't say that

>this is definitely NOT trivial
This isn't trivial, but is rather important

>it's hard to come up with a method that works for any layout. 

Here is a potential algorithm:

Look first to Meta Description, if none exists
Look for continuous block of text, ignore content that doesn't contain a
continous block of text.  If a given html tag only contains a few words
of text, it is not content , but rather a part of the nav structure of
the page.

Here is yet another algorithm.

When fetching pages from a particular web, analyze the structure of the
page, try to make a determination of what content stays similar from
page to page within the same web.  That would usually be menus, headers,
footers, etc. 
Granted the menus may change slightly from page to page, which is why
the algorithm would be pattern based instead of literal.
When you determine what is navigation and what is content, you would
only parse and index the content.

I think algortihm # 1 is what google uses.
google ignores content that does not change from page to page, as well
as content that isn't part of a pblock of text.

Comments please

-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 10, 2006 1:57 PM
To: nutch-dev@lucene.apache.org
Subject: Re: quality of search text


Richard Braman wrote:
> I too have noticed menu text appearing in the search results.
>   

The proper place to fix it would be in parse-html, perhaps in 
DOMContentUtils.

However, be warned that this is definitely NOT trivial - i.e. it doesn't

say in pages "this is menu, this is body text", you have to figure it 
out, and it's hard to come up with a method that works for any layout. 
You may hardcode something that works well for your target group of 
hosts, with pre-determined page layouts.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||
\|  ||  |  Embedded Unix, System Integration http://www.sigram.com
Contact: info at sigram dot com

Reply via email to