Well, Link spamming is a big problem -- actually spamming is not as bad as
having the crawler get stuck in a Blog -> Link blog -> Adwords -> Blog ->
more Ads -> LinkBlog....you get the picture. It's a combination of Links on
a page and a little circular web rolled into one. A good way to get around
link spamming and come up with a better score is to look at usage (people
don't spend time or visit pages that are useless) and combine that with the
link boosts.  

I agree with Doug that the links boosts work "surprisingly" well. 60% of the
time they are more than sufficient by themselves. Though this approach does
have problems, namely:

1. A page with many internal links (same domain) tends to get a higher
ranking (unless internal links are ignored).
2. A page can easily be artificially given a higher score by having 2 more
sites cross-link. (Blogs tend to get a higher score for this reason)

For Filangy, we needed to come up with a decent scoring system, akin to Page
Rank but one that would not take so long to compute and meet the following
goals.

A. Weight recent pages more heavily
B. Be fast to compute and update 
C. Not cause the segments to be reindexed

We looked at the Fast page rank algo as well, but Page Rank is not exactly
what we need. So, what we did is changed the way lucene does it's scoring.
We multiply the link boosts, with another score which is calculated
externally and updated every 30 mins and placed within the segment (this
file is read when the boost file is read by lucene). There is also another
set of files that keep track of the score for each user, but that's a very
special case for us. 

Where am I going with this? I think here is where we can contribute to
Nutch. Given that we compute the scores for each URL, we can figure out some
way to normalize this and package it like a DB which has the URL MD5 and a
score. This DB can be updated every other week or so. That way when Nutch is
indexing documents it can compute it regular link based score and combine it
with this usage based one. This will mean NO changes need to be made to the
Nutch search code (just the part where indexing happens).  

Would this be of interest to anyone?

Regards,
CC

--------------------------------------------
Filangy, Inc.
Interested in Improving Search? Join our Team!
http://filangy.com/jointheteam.jsp 
 

-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 19, 2005 2:03 PM
To: [email protected]
Subject: Re: link analysis

Doug Cutting wrote:
> Are many folks sucessfully using the link analysis implementation?  I

I stopped using it a long time ago, due to performance problems even with
moderately-sized databases.

> had problems with it the last time I tried, and others have reported 
> problems.  Since no one is currently maintaining this code, I propose 
> that we:
> 
>  1. Remove mention of it from the tutorial; and
> 
>  2. Change the defaults for fetchlist.score.by.link.count and 
> indexer.boost.by.link.count to true.
> 
> Objections?

No objections. However, we need to think carefully what is then the
recommended procedure to maintain the scoring quality. Originally, the
analysis step was intended to do this. Now, the parameters in 2. above are
supposed to affect scoring in the right way. It would be useful to write a
short explanation why this is so - but even more important IMHO would be to
study what are the shortcomings of this approach (link spamming), and how to
combat them.

This is probably the most difficult, but the most important step to ensure a
high quality of scoring... It would be great to compile a list of
suggestions for this - initially this could take a form of reports about
problems with scoring, or endless looping or similar. Then we could think of
solutions, and how/where they should be implemented.

-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Reply via email to