>> Actually a custom plugin wouldn't work in this instance because it wouldn't affect the document boost score.
Is that true? Doesn't indexerScore method in opic-scoring affect index boost score - or methods like updatedbscore etc to update datum score? -D. -----Original Message----- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 23, 2008 11:36 AM To: [email protected] Subject: Re: Dedup Question Actually a custom plugin wouldn't work in this instance because it wouldn't affect the document boost score. You would need to operate on the crawldb directly or have a different indexer. I will send you a hacked out ArbitraryIndexer that uses RPN to arbitrarily boost scores. That being said I have completed work on a new scoring and indexing framework which stabilizes link scores and makes indexing much more flexible. That should be released very soon. Dennis Patrick Markiewicz wrote: > Is there a way to configure nutch's scoring-opic plugin to bump up the > score of a particular domain? Or does it require a custom scoring > plugin to do so? Thanks. > > Patrick > > -----Original Message----- > From: Dennis Kubes [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 23, 2008 10:55 AM > To: [email protected] > Subject: Re: Dedup Question > > It will remove the one with the lowest score in the crawldb as set by > the scoring filters. Dedup first removes by url then by content hash. > If the content is changed even slightly though it will *not* be detected > > as a duplicate. Solving that problem is called near duplicate detection > > (ndd) and uses an algorithm called shingling which isn't currently > implemented in Nutch (but hopefully will be in the near future). > > Dennis > > Patrick Markiewicz wrote: >> Hi, >> >> If I have a url http://www.example.com/index.html stored > in >> my index with the content: EMPTY FILE, and I have a file >> http://www.domain.com/index.html with the content: EMPTY FILE, then > the >> two files are duplicates. Which one will the de-duplication process >> remove from the index? Thanks. >> >> >> >> Patrick >> >>
