Re: Need help with deleteduplicates

Dennis Kubes Wed, 20 Dec 2006 08:51:06 -0800

If I am understanding what you are asking, in the getRecordReader methodof the InputFormat innner class in DeleteDuplicates it gets the hashscore from the document. You could put your algorithm there and returnsome type of numeric value based on analysis of the document fields.You would need to write a different class for HashScore and return itfrom the record reader. You would probably want to keep the IndexDocbeing written out as the value in dedup phase 1 (in the job config) butchange the key to your HashScore replacement class. You would need tochange HashPartitioner to partition according to your new key numeric.The HashReducer would also need to be changed to collect only the onesyou want based on your new key numeric.The dedup phase 2 deletes by url so if you want to remove exact urlsthen you would leave it in otherwise you might want to take the jobconfig section for phase 2 out.


Hope this helps.


Dennis

sdeck wrote:

Hello,
  I am running nutch .8 against hadoop .4, just for reference
I want to add a delete duplicate based on a similarity algorithm, as opposed
to the hash method that is currently in there.
I would have to say I am pretty lost as to how the delete duplicates class
is working.
I would guess that I need to implement a compareTo method, but I am not
really sure what to return. Also, when I do return something, where do I
implement the functionality to say "yes, these are dupes, so remove the
first one)

Can anyone help out?
Thanks,
S

Re: Need help with deleteduplicates

Reply via email to