Hello,

I do not know which method of computing score is really better but I would like to clarify one issue: - all methods (dbanalyze,fetchlist.score.by.link.count,indexer.boost.by.link.count) use inlinks as far as I can tell from the code itself:

fetchlist.score.by.link.count is used here:

  curScore.set(scoreByLinkCount ?
(float)Math.log(anchors.length+1) : page.getScore());
Where anchors is an array of anchors pointing to given url.


indexer.boost.by.link.count is used in IndexSegment.java:

    // 1. Start with page's score from DB -- 1.0 if no link analysis.
    float boost = fo.getFetchListEntry().getPage().getScore();
    // 2. Apply scorePower to this.
    boost = (float)Math.pow(boost, scorePower);
    // 3. Optionally boost by log of incoming anchor count.
    if (boostByLinkCount)
      boost *= (float)Math.log(Math.E + fo.getAnchors().length);

So if both fetchlist.score.by.link.count and indexer.boost.by.link.count properties are set number of inliks would be used in fact twice in score computation.


In my opinion the main difference between using simply number of inlinks as indexer.boost.by.link.count and fetchlist.score.by.link.count methods do and db analyze (PageRank computation) is taking into account quality of inlinks. For fetchlist.score.by.link.count and indexer.boost.by.link.count all inliks are treated equally - but PageRank takes into account score of the Page inlink originates from in its computation. So I suppose it should provide better results but because of link spam etc - I would not dare to claim so. I am doing some tests on my collections right now but it is difficult to judge if the results are really better with PageRank.

Regards
Piotr




Andrzej Bialecki wrote:
Byron Miller wrote:

Here is what the great Doug said:

"
Are you using link analysis? Perhaps it is doing you a disservice by
prioritizing one site above the others. Try, in place of the analyze
command, setting setting both fetchlist.score.by.link.count and
indexer.boost.by.link.count to true. Please tell us how that works for you.

Doug"

I did this and haven't ran analyze since then and you can see the results
on mozdex.com looking pretty good!


Both methods boost up well-connected pages, and penalize poorly-connected ones. However, if I understand this correctly the implications of using this method instead of DB analysis are the following:

* DB analysis builds a web graph to discover how many incoming links point to a given page, and calculates the score based on that (which is essentially what Google's PageRank is about)

* scoring by outlink count also promotes well-connected pages, but this time the ones with a lot of _outgoing_ links.

PageRank is based on an assumption about a social behaviour, that people will link to pages they find interesting and relevant, so a well-linked page must be therefore important. Such page will get a higher score (will be considered more relevant to the query, all other factors being equal).

The method that scores by outlink count seems to promote pages that are just link directories. However, in reality such pages don't have to be more relevant to the query than pages with few outlinks - because they may point to many very disparate areas. But they will still get a higher score just by the virtue of having a lot of outgoing links. So, in this case the relationship between the social behaviour of linking to interesting pages, and page relevance, doesn't apply, because the links don't reflect someone's judgment that this page is interesting.

Having said that, I'm a practical person, too - if it works well enough, then the better for us. :-) And PageRank is not the oracle either.


Reply via email to