Hi Ahmet,

You don't need to use the ScoringFilters at all.  The
nutch.scoring.webgraph package can be taken as an example of how to do. It
works fine as far as I know but what we wanted with the Giraph-based
replacement was to have less code to maintain and also have something we
could use in 2.x straight away. If there are performance improvements as
well, all the better for it!

Thanks

Julien


On 29 May 2013 09:00, Ahmet Emre Aladağ <[email protected]> wrote:

> Hi,
>
> I'm working on LinkRank Implementation with Giraph for both Nutch 1.x and
> 2.x.  What I'm planning [1] is to get the outlink data and give it as a
> graph to Giraph and perform LinkRank calculation . Then read the results
> and inject them back to Nutch.
>
> Summary of the task:
> 1. Get the outlinkDB, write it as [URL, URL] pairs on HDFS,
> 2. Write current (initial) scores from CrawlDB as [URL, Score] on HDFS.
> 3. Run Giraph LinkRank.
> 4. Read the resulting [URL, NewScore] pairs
> 5. Update CrawlDB with the new scores.
>
> So the plugin will be like a proxy.
>
> As far as I can see, ScoringFilter mechanism in 1.x requires
> implementation of methods for urls one-by-one
>
> Ex:
>   public CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData
> parseData,
>           Collection<Entry<Text, CrawlDatum>> targets, CrawlDatum adjust,
>           int allCount) throws ScoringFilterException;
>
>
> But I'd like to write/read the whole db. Now I think that instead of a
> ScoringFilter, I should write a generic plugin to achieve this. Should I
> extend Pluggable? Could you give suggestions for what could be the best way
> to achieve this? I'm starting with 1.x but will come for 2.x so suggestions
> for both are welcomed
>
> Thanks,
>
>
> [1] https://cwiki.apache.org/**confluence/pages/viewpage.**
> action?pageId=31820383<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=31820383>
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to