RE: Generic LinkRank plugin for Nutch

Markus Jelsma Wed, 29 May 2013 14:42:35 -0700

Hi Ahmet!

This is really interesting and i'm very curious about performance improvements. 
For example, it can take many hours to calculate a billions of records for 10 
power iterations in 1.x! Please open a new issue at Jira whenever you're ready 
and consider copying your wiki page to the Nutch MoinMoin wiki, it is certainly 
going to be very useful!


Thanks,
Markus

 
 
-----Original message-----
> From:Ahmet Emre Aladağ <[email protected]>
> Sent: Wed 29-May-2013 10:00
> To: [email protected]
> Subject: Generic LinkRank plugin for Nutch
> 
> Hi,
> 
> I'm working on LinkRank Implementation with Giraph for both Nutch 1.x 
> and 2.x.  What I'm planning [1] is to get the outlink data and give it 
> as a graph to Giraph and perform LinkRank calculation . Then read the 
> results and inject them back to Nutch.
> 
> Summary of the task:
> 1. Get the outlinkDB, write it as [URL, URL] pairs on HDFS,
> 2. Write current (initial) scores from CrawlDB as [URL, Score] on HDFS.
> 3. Run Giraph LinkRank.
> 4. Read the resulting [URL, NewScore] pairs
> 5. Update CrawlDB with the new scores.
> 
> So the plugin will be like a proxy.
> 
> As far as I can see, ScoringFilter mechanism in 1.x requires 
> implementation of methods for urls one-by-one
> 
> Ex:
>    public CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData 
> parseData,
>            Collection<Entry<Text, CrawlDatum>> targets, CrawlDatum adjust,
>            int allCount) throws ScoringFilterException;
> 
> 
> But I'd like to write/read the whole db. Now I think that instead of a 
> ScoringFilter, I should write a generic plugin to achieve this. Should I 
> extend Pluggable? Could you give suggestions for what could be the best 
> way to achieve this? I'm starting with 1.x but will come for 2.x so 
> suggestions for both are welcomed
> 
> Thanks,
> 
> 
> [1] 
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=31820383
>

RE: Generic LinkRank plugin for Nutch

Reply via email to