Would stuff like this make a good candidate for a "Nutch API" for common processes/jobs?
Or pull out the calculations into some sort of config file/mathML file or something? (assuming much of the stuff is using some kind of math of sorts..) -----Original Message----- From: Andrzej Bialecki <[EMAIL PROTECTED]> To: [email protected] Date: Mon, 18 Apr 2005 19:39:18 +0200 Subject: Re: [Nutch-general] Re: Index merging > > > The de-duplication algorithm should be abstracted and separated into a > utility method/class - currently both DeleteDuplicates and > SegmentMergeTool perform de-duplication, but I'm afraid that each > follows a slightly different, hardcoded routine... > > SegmentMergeTool uses a simpler algorithm: for all documents with equal > url hash or content hash, keep only the latest document. Neither link > scores nor url length is taken into account... :-( > ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
