Dennis Kubes
Wed, 23 Jul 2008 08:38:50 -0700
That being said I have completed work on a new scoring and indexing framework which stabilizes link scores and makes indexing much more flexible. That should be released very soon.
Dennis Patrick Markiewicz wrote:
Is there a way to configure nutch's scoring-opic plugin to bump up the score of a particular domain? Or does it require a custom scoring plugin to do so? Thanks. Patrick -----Original Message-----From: Dennis Kubes [EMAIL PROTECTED] Sent: Wednesday, July 23, 2008 10:55 AMTo: nutch-user@lucene.apache.org Subject: Re: Dedup QuestionIt will remove the one with the lowest score in the crawldb as set by the scoring filters. Dedup first removes by url then by content hash. If the content is changed even slightly though it will *not* be detectedas a duplicate. Solving that problem is called near duplicate detection(ndd) and uses an algorithm called shingling which isn't currently implemented in Nutch (but hopefully will be in the near future).Dennis Patrick Markiewicz wrote:Hi, If I have a url http://www.example.com/index.html storedinmy index with the content: EMPTY FILE, and I have a file http://www.domain.com/index.html with the content: EMPTY FILE, thenthetwo files are duplicates. Which one will the de-duplication process remove from the index? Thanks.Patrick