touché -----Original Message----- From: Jérôme Charron [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 4:34 PM To: [email protected]; [EMAIL PROTECTED] Subject: Re: quality of search text
> I think algortihm # 1 is what google uses. > google ignores content that does not change from page to page, as well > as content that isn't part of a pblock of text. Are you sure? Take a look at this search results: http://www.google.com/search?hl=en&hs=otT&lr=&c2coff=1&safe=off&client=f irefox-a&rls=org.mozilla:en-US:official&pwst=1&q=+site:gamingalmanac.com +global+gaming+almanac ... and you will notice that menus are indexed by google and displayed in summaries. But if you can contribute a HtmlParseFilter with ability to remove menus and navigation, it will be a real improvement. A first step, that I have developed in a previous project many years ago is to remove pages that contains textual content only in links: it avoid indexing frames or iframes that only contains some navigation text... Jérôme -- http://motrech.free.fr/ http://www.frutch.org/ ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
