Hi,

I'm working on LinkRank Implementation with Giraph for both Nutch 1.x and 2.x. What I'm planning [1] is to get the outlink data and give it as a graph to Giraph and perform LinkRank calculation . Then read the results and inject them back to Nutch.

Summary of the task:
1. Get the outlinkDB, write it as [URL, URL] pairs on HDFS,
2. Write current (initial) scores from CrawlDB as [URL, Score] on HDFS.
3. Run Giraph LinkRank.
4. Read the resulting [URL, NewScore] pairs
5. Update CrawlDB with the new scores.

So the plugin will be like a proxy.

As far as I can see, ScoringFilter mechanism in 1.x requires implementation of methods for urls one-by-one

Ex:
public CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData parseData,
          Collection<Entry<Text, CrawlDatum>> targets, CrawlDatum adjust,
          int allCount) throws ScoringFilterException;


But I'd like to write/read the whole db. Now I think that instead of a ScoringFilter, I should write a generic plugin to achieve this. Should I extend Pluggable? Could you give suggestions for what could be the best way to achieve this? I'm starting with 1.x but will come for 2.x so suggestions for both are welcomed

Thanks,


[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=31820383

Reply via email to