Well, Link spamming is a big problem -- actually spamming is not as bad as having the crawler get stuck in a Blog -> Link blog -> Adwords -> Blog -> more Ads -> LinkBlog....you get the picture. It's a combination of Links on a page and a little circular web rolled into one. A good way to get around link spamming and come up with a better score is to look at usage (people don't spend time or visit pages that are useless) and combine that with the link boosts.
I agree with Doug that the links boosts work "surprisingly" well. 60% of the time they are more than sufficient by themselves. Though this approach does have problems, namely: 1. A page with many internal links (same domain) tends to get a higher ranking (unless internal links are ignored). 2. A page can easily be artificially given a higher score by having 2 more sites cross-link. (Blogs tend to get a higher score for this reason) For Filangy, we needed to come up with a decent scoring system, akin to Page Rank but one that would not take so long to compute and meet the following goals. A. Weight recent pages more heavily B. Be fast to compute and update C. Not cause the segments to be reindexed We looked at the Fast page rank algo as well, but Page Rank is not exactly what we need. So, what we did is changed the way lucene does it's scoring. We multiply the link boosts, with another score which is calculated externally and updated every 30 mins and placed within the segment (this file is read when the boost file is read by lucene). There is also another set of files that keep track of the score for each user, but that's a very special case for us. Where am I going with this? I think here is where we can contribute to Nutch. Given that we compute the scores for each URL, we can figure out some way to normalize this and package it like a DB which has the URL MD5 and a score. This DB can be updated every other week or so. That way when Nutch is indexing documents it can compute it regular link based score and combine it with this usage based one. This will mean NO changes need to be made to the Nutch search code (just the part where indexing happens). Would this be of interest to anyone? Regards, CC -------------------------------------------- Filangy, Inc. Interested in Improving Search? Join our Team! http://filangy.com/jointheteam.jsp -----Original Message----- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 19, 2005 2:03 PM To: [email protected] Subject: Re: link analysis Doug Cutting wrote: > Are many folks sucessfully using the link analysis implementation? I I stopped using it a long time ago, due to performance problems even with moderately-sized databases. > had problems with it the last time I tried, and others have reported > problems. Since no one is currently maintaining this code, I propose > that we: > > 1. Remove mention of it from the tutorial; and > > 2. Change the defaults for fetchlist.score.by.link.count and > indexer.boost.by.link.count to true. > > Objections? No objections. However, we need to think carefully what is then the recommended procedure to maintain the scoring quality. Originally, the analysis step was intended to do this. Now, the parameters in 2. above are supposed to affect scoring in the right way. It would be useful to write a short explanation why this is so - but even more important IMHO would be to study what are the shortcomings of this approach (link spamming), and how to combat them. This is probably the most difficult, but the most important step to ensure a high quality of scoring... It would be great to compile a list of suggestions for this - initially this could take a form of reports about problems with scoring, or endless looping or similar. Then we could think of solutions, and how/where they should be implemented. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
