>> Actually a custom plugin wouldn't work in this instance because it
wouldn't affect the document boost score.

Is that true? Doesn't indexerScore method in opic-scoring affect index boost
score - or methods like updatedbscore etc to update datum score?

-D.

-----Original Message-----
From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 23, 2008 11:36 AM
To: [email protected]
Subject: Re: Dedup Question

Actually a custom plugin wouldn't work in this instance because it 
wouldn't affect the document boost score.  You would need to operate on 
the crawldb directly or have a different indexer.  I will send you a 
hacked out ArbitraryIndexer that uses RPN to arbitrarily boost scores.

That being said I have completed work on a new scoring and indexing 
framework which stabilizes link scores and makes indexing much more 
flexible.  That should be released very soon.

Dennis

Patrick Markiewicz wrote:
> Is there a way to configure nutch's scoring-opic plugin to bump up the
> score of a particular domain?  Or does it require a custom scoring
> plugin to do so?  Thanks.
> 
> Patrick
> 
> -----Original Message-----
> From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, July 23, 2008 10:55 AM
> To: [email protected]
> Subject: Re: Dedup Question
> 
> It will remove the one with the lowest score in the crawldb as set by 
> the scoring filters.  Dedup first removes by url then by content hash. 
> If the content is changed even slightly though it will *not* be detected
> 
> as a duplicate.  Solving that problem is called near duplicate detection
> 
> (ndd) and uses an algorithm called shingling which isn't currently 
> implemented in Nutch (but hopefully will be in the near future).
> 
> Dennis
> 
> Patrick Markiewicz wrote:
>> Hi,
>>
>>             If I have a url http://www.example.com/index.html stored
> in
>> my index with the content: EMPTY FILE, and I have a file
>> http://www.domain.com/index.html with the content: EMPTY FILE, then
> the
>> two files are duplicates.  Which one will the de-duplication process
>> remove from the index?  Thanks.
>>
>>  
>>
>> Patrick
>>
>>

Reply via email to