>it doesn't say in pages "this is menu, this is body text", Agreed it doesn't say that
>this is definitely NOT trivial This isn't trivial, but is rather important >it's hard to come up with a method that works for any layout. Here is a potential algorithm: Look first to Meta Description, if none exists Look for continuous block of text, ignore content that doesn't contain a continous block of text. If a given html tag only contains a few words of text, it is not content , but rather a part of the nav structure of the page. Here is yet another algorithm. When fetching pages from a particular web, analyze the structure of the page, try to make a determination of what content stays similar from page to page within the same web. That would usually be menus, headers, footers, etc. Granted the menus may change slightly from page to page, which is why the algorithm would be pattern based instead of literal. When you determine what is navigation and what is content, you would only parse and index the content. I think algortihm # 1 is what google uses. google ignores content that does not change from page to page, as well as content that isn't part of a pblock of text. Comments please -----Original Message----- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 1:57 PM To: nutch-dev@lucene.apache.org Subject: Re: quality of search text Richard Braman wrote: > I too have noticed menu text appearing in the search results. > The proper place to fix it would be in parse-html, perhaps in DOMContentUtils. However, be warned that this is definitely NOT trivial - i.e. it doesn't say in pages "this is menu, this is body text", you have to figure it out, and it's hard to come up with a method that works for any layout. You may hardcode something that works well for your target group of hosts, with pre-determined page layouts. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com