Re: [Nutch-dev] Creating a new scoring filter

Lorenzo Tue, 24 Apr 2007 14:06:32 -0700

Very briefly, with an HtmlParseFilter and a list of weighted words.
This filter examines the Parse text and add a boost value if it finds 
one of the words in the list.
This boost value is added to ParseData MetaData.
Then, a ScoringPlugin reads this MetaData (passScoreAfterParsing) and 
update the CrawlData, both of outlinked pages (to focus more the search)
and of the current page (the difficult part, as explained in the ml; 
however, with NUTCH-468 it should be easyer now)


If you need other informations, please ask!

Lorenzo


Briggs wrote:
> Yes.  I too need to alter the score based on attributes and such of
> the particular url passed.
> May I ask what you have done?
>
>
> On 4/22/07, Lorenzo <[EMAIL PROTECTED]> wrote:
>> Perfect! Now I have it working, and it performs quite well for a focused
>> serch engine like ours!
>> Do you think it could be an interesting plug-in to add to nutch?
>>
>> Lorenzo
>>
>>
>> Doğacan Güney wrote:
>> > On 4/21/07, Lorenzo <[EMAIL PROTECTED]> wrote:
>> >>
>> >> Uhmm... so, suppose I decided, from its content, that the current 
>> page
>> >> http://foo/bar.htm is really desiderable.
>> >> I have put in ParseData's metadata a flag to mark it.
>> >> In distributeScoreToOutlink(s) I read it from the ParseData param, 
>> and
>> >> put it in the adjust CrawlData metadata
>> >>
>> >>       MapWritable adjustMap = adjust.getMetaData();
>> >>       adjustMap.put(key, new FloatWritable(bootsValue));
>> >>       return adjust;
>> >>
>> >> So in updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
>> >> inlinked)
>> >> the adjust CrawlData will be between the inlinked List. Is it 
>> right? How
>> >> do I distinguish it?
>> >> I can put the URL in metadata too, and scroll through the list, but
>> >> maybe there is a better method?
>> >
>> >
>> >
>> > Best approach is yours, you should put a flag in adjust datum's
>> > metadata to
>> > mark it, then process it in updateDbScore.
>> >
>> > Also, this CrawlDatum will be the same that is passed to indexerScore?
>> >
>> >
>> > You get 2 CrawlDatum's in indexerScore. First is fetchDatum which is
>> > the one
>> > in crawl_fetch that contains the fetching status. Second is dbDatum 
>> which
>> > comes from crawldb. This dbDatum is the one that you set in
>> > updateDbScore(The 'datum' argument of updateDbScore)
>> >
>> >
>> > Thanks a lot!
>> >>
>> >> Lorenzo
>> >>
>> >>
>> >
>> >
>>
>>
>
>



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Creating a new scoring filter

Reply via email to