Is there a way to configure nutch's scoring-opic plugin to bump up the
score of a particular domain?  Or does it require a custom scoring
plugin to do so?  Thanks.

Patrick

-----Original Message-----
From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 23, 2008 10:55 AM
To: [email protected]
Subject: Re: Dedup Question

It will remove the one with the lowest score in the crawldb as set by 
the scoring filters.  Dedup first removes by url then by content hash. 
If the content is changed even slightly though it will *not* be detected

as a duplicate.  Solving that problem is called near duplicate detection

(ndd) and uses an algorithm called shingling which isn't currently 
implemented in Nutch (but hopefully will be in the near future).

Dennis

Patrick Markiewicz wrote:
> Hi,
> 
>             If I have a url http://www.example.com/index.html stored
in
> my index with the content: EMPTY FILE, and I have a file
> http://www.domain.com/index.html with the content: EMPTY FILE, then
the
> two files are duplicates.  Which one will the de-duplication process
> remove from the index?  Thanks.
> 
>  
> 
> Patrick
> 
> 

Reply via email to