Would stuff like this make a good candidate for a "Nutch API" for common
processes/jobs? 

Or pull out the calculations into some sort of config file/mathML file or
something? (assuming much of the stuff is using some kind of math of sorts..)

-----Original Message-----
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: [email protected]
Date: Mon, 18 Apr 2005 19:39:18 +0200
Subject: Re: [Nutch-general] Re: Index merging

>
> 
> The de-duplication algorithm should be abstracted and separated into a 
> utility method/class - currently both DeleteDuplicates and 
> SegmentMergeTool perform de-duplication, but I'm afraid that each 
> follows a slightly different, hardcoded routine...
> 
> SegmentMergeTool uses a simpler algorithm: for all documents with equal
> url hash or content hash, keep only the latest document. Neither link 
> scores nor url length is taken into account... :-(
> 

Reply via email to