In web search, link information helps greatly. (This was Google's big discovery.) There are lots more links that point to http://www.slashdot.org/ than to http://www.slashdot.org/xxx/yyy, and many (if not most) of these links have the term "slashdot", while links to http://www.slashdot.org/xxx/yyy are somewhat less likely to contain the term "slashdot".

As Erik hinted, Nutch uses this information. It keeps has a database of links that point to each page, indexes their anchor text along with the page, and boosts highly linked pages more than lesser linked pages.

Doug

Chris Fraschetti wrote:
My lucene implementation works great, its basically an index of many
web crawls. The main thing my users complain about is say a search for
"slashdot" will return the
http://www.slashdot.org/soem_dir/somepage.asp as the top result
because the factors i have scoring it determine it as so... but
obviously in true search engine fashion.. i would like
http://www.slashdot.org/ to be the very top result... i've added a
boost to queries that match the hostname field, which helped a little,
but obviously not a proper solution. Does anyone out there in the
search engine world have a good schema for determining root websites
and applying a huge boost to them in one fashion or another? mainly so
it appears before any sub pages? (assuming the query is in reference
to that site) ...


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to