In web search, link information helps greatly. (This was Google's big discovery.) There are lots more links that point to than to, and many (if not most) of these links have the term "slashdot", while links to are somewhat less likely to contain the term "slashdot".

As Erik hinted, Nutch uses this information. It keeps has a database of links that point to each page, indexes their anchor text along with the page, and boosts highly linked pages more than lesser linked pages.


Chris Fraschetti wrote:
My lucene implementation works great, its basically an index of many
web crawls. The main thing my users complain about is say a search for
"slashdot" will return the as the top result
because the factors i have scoring it determine it as so... but
obviously in true search engine fashion.. i would like to be the very top result... i've added a
boost to queries that match the hostname field, which helped a little,
but obviously not a proper solution. Does anyone out there in the
search engine world have a good schema for determining root websites
and applying a huge boost to them in one fashion or another? mainly so
it appears before any sub pages? (assuming the query is in reference
to that site) ...

--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to