If I am understanding what you are asking, in the getRecordReader method of the InputFormat innner class in DeleteDuplicates it gets the hash score from the document. You could put your algorithm there and return some type of numeric value based on analysis of the document fields. You would need to write a different class for HashScore and return it from the record reader. You would probably want to keep the IndexDoc being written out as the value in dedup phase 1 (in the job config) but change the key to your HashScore replacement class. You would need to change HashPartitioner to partition according to your new key numeric. The HashReducer would also need to be changed to collect only the ones you want based on your new key numeric.
The dedup phase 2 deletes by url so if you want to remove exact urls then you would leave it in otherwise you might want to take the job config section for phase 2 out. Hope this helps. Dennis sdeck wrote: > Hello, > I am running nutch .8 against hadoop .4, just for reference > I want to add a delete duplicate based on a similarity algorithm, as opposed > to the hash method that is currently in there. > I would have to say I am pretty lost as to how the delete duplicates class > is working. > I would guess that I need to implement a compareTo method, but I am not > really sure what to return. Also, when I do return something, where do I > implement the functionality to say "yes, these are dupes, so remove the > first one) > > Can anyone help out? > Thanks, > S > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
