Hi,
I'm working on LinkRank Implementation with Giraph for both Nutch 1.x
and 2.x. What I'm planning [1] is to get the outlink data and give it
as a graph to Giraph and perform LinkRank calculation . Then read the
results and inject them back to Nutch.
Summary of the task:
1. Get the outlinkDB, write it as [URL, URL] pairs on HDFS,
2. Write current (initial) scores from CrawlDB as [URL, Score] on HDFS.
3. Run Giraph LinkRank.
4. Read the resulting [URL, NewScore] pairs
5. Update CrawlDB with the new scores.
So the plugin will be like a proxy.
As far as I can see, ScoringFilter mechanism in 1.x requires
implementation of methods for urls one-by-one
Ex:
public CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData
parseData,
Collection<Entry<Text, CrawlDatum>> targets, CrawlDatum adjust,
int allCount) throws ScoringFilterException;
But I'd like to write/read the whole db. Now I think that instead of a
ScoringFilter, I should write a generic plugin to achieve this. Should I
extend Pluggable? Could you give suggestions for what could be the best
way to achieve this? I'm starting with 1.x but will come for 2.x so
suggestions for both are welcomed
Thanks,
[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=31820383