Would stuff like this make a good candidate for a "Nutch API" for common processes/jobs?
Or pull out the calculations into some sort of config file/mathML file or something? (assuming much of the stuff is using some kind of math of sorts..) -----Original Message----- From: Andrzej Bialecki <[EMAIL PROTECTED]> To: [email protected] Date: Mon, 18 Apr 2005 19:39:18 +0200 Subject: Re: [Nutch-general] Re: Index merging > > > The de-duplication algorithm should be abstracted and separated into a > utility method/class - currently both DeleteDuplicates and > SegmentMergeTool perform de-duplication, but I'm afraid that each > follows a slightly different, hardcoded routine... > > SegmentMergeTool uses a simpler algorithm: for all documents with equal > url hash or content hash, keep only the latest document. Neither link > scores nor url length is taken into account... :-( >
