Generic LinkRank plugin for Nutch

Ahmet Emre Aladağ Wed, 29 May 2013 01:00:56 -0700

Hi,

I'm working on LinkRank Implementation with Giraph for both Nutch 1.xand 2.x. What I'm planning [1] is to get the outlink data and give itas a graph to Giraph and perform LinkRank calculation . Then read theresults and inject them back to Nutch.


Summary of the task:
1. Get the outlinkDB, write it as [URL, URL] pairs on HDFS,
2. Write current (initial) scores from CrawlDB as [URL, Score] on HDFS.
3. Run Giraph LinkRank.
4. Read the resulting [URL, NewScore] pairs
5. Update CrawlDB with the new scores.

So the plugin will be like a proxy.

As far as I can see, ScoringFilter mechanism in 1.x requiresimplementation of methods for urls one-by-one

Ex:

public CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseDataparseData,

          Collection<Entry<Text, CrawlDatum>> targets, CrawlDatum adjust,
          int allCount) throws ScoringFilterException;

But I'd like to write/read the whole db. Now I think that instead of aScoringFilter, I should write a generic plugin to achieve this. Should Iextend Pluggable? Could you give suggestions for what could be the bestway to achieve this? I'm starting with 1.x but will come for 2.x sosuggestions for both are welcomed


Thanks,

[1]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=31820383

Generic LinkRank plugin for Nutch

Reply via email to