touché 

-----Original Message-----
From: Jérôme Charron [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 10, 2006 4:34 PM
To: [email protected]; [EMAIL PROTECTED]
Subject: Re: quality of search text


> I think algortihm # 1 is what google uses.
> google ignores content that does not change from page to page, as well

> as content that isn't part of a pblock of text.

Are you sure?
Take a look at this search results:
http://www.google.com/search?hl=en&hs=otT&lr=&c2coff=1&safe=off&client=f
irefox-a&rls=org.mozilla:en-US:official&pwst=1&q=+site:gamingalmanac.com
+global+gaming+almanac
... and you will notice that menus are indexed by google and displayed
in summaries.

But if you can contribute a HtmlParseFilter with ability to remove menus
and navigation, it will be a real improvement. A first step, that I have
developed in a previous project many years ago is to remove pages that
contains textual content only in links: it avoid indexing frames or
iframes that only contains some navigation text...

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to