Hi Ahmet! This is really interesting and i'm very curious about performance improvements. For example, it can take many hours to calculate a billions of records for 10 power iterations in 1.x! Please open a new issue at Jira whenever you're ready and consider copying your wiki page to the Nutch MoinMoin wiki, it is certainly going to be very useful!
Thanks, Markus -----Original message----- > From:Ahmet Emre Aladağ <[email protected]> > Sent: Wed 29-May-2013 10:00 > To: [email protected] > Subject: Generic LinkRank plugin for Nutch > > Hi, > > I'm working on LinkRank Implementation with Giraph for both Nutch 1.x > and 2.x. What I'm planning [1] is to get the outlink data and give it > as a graph to Giraph and perform LinkRank calculation . Then read the > results and inject them back to Nutch. > > Summary of the task: > 1. Get the outlinkDB, write it as [URL, URL] pairs on HDFS, > 2. Write current (initial) scores from CrawlDB as [URL, Score] on HDFS. > 3. Run Giraph LinkRank. > 4. Read the resulting [URL, NewScore] pairs > 5. Update CrawlDB with the new scores. > > So the plugin will be like a proxy. > > As far as I can see, ScoringFilter mechanism in 1.x requires > implementation of methods for urls one-by-one > > Ex: > public CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData > parseData, > Collection<Entry<Text, CrawlDatum>> targets, CrawlDatum adjust, > int allCount) throws ScoringFilterException; > > > But I'd like to write/read the whole db. Now I think that instead of a > ScoringFilter, I should write a generic plugin to achieve this. Should I > extend Pluggable? Could you give suggestions for what could be the best > way to achieve this? I'm starting with 1.x but will come for 2.x so > suggestions for both are welcomed > > Thanks, > > > [1] > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=31820383 >

