>nowadays many pages freely mix in markup in the main content area... Yes, but if that content was nested in a larger block of content, then it would be included.
I will probably end up implmenting some of these algorithms, but I would like some good feedback before I go out on a limb. -----Original Message----- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 2:51 PM To: nutch-dev@lucene.apache.org Subject: Re: quality of search text Richard Braman wrote: > Here is a potential algorithm: > > Look first to Meta Description, if none exists > Look for continuous block of text, ignore content that doesn't contain > a continous block of text. If a given html tag only contains a few > words of text, it is not content , but rather a part of the nav > structure of the page. > > You may potentially miss a lot of content this way, nowadays many pages freely mix in markup in the main content area... > Here is yet another algorithm. > > When fetching pages from a particular web, analyze the structure of > the page, try to make a determination of what content stays similar > from page to page within the same web. That would usually be menus, > headers, footers, etc. > This requires collecting pages in advance to train the structure recognizer, and preparing "profiles" for groups of pages with common layout. > Granted the menus may change slightly from page to page, which is why > the algorithm would be pattern based instead of literal. When you > determine what is navigation and what is content, you would only parse > and index the content. > > I think algortihm # 1 is what google uses. > google ignores content that does not change from page to page, as well > as content that isn't part of a pblock of text. > > Comments please > The best way to evaluate this would be to ..erhm.. evaluate these algorithms on a set of reference pages. Would you like to implement one or both algorithms and test them? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com