If I am understanding what you are asking, in the getRecordReader method
of the InputFormat innner class in DeleteDuplicates it gets the hash
score from the document. You could put your algorithm there and return
some type of numeric value based on analysis of the document fields.
You would need to write a different class for HashScore and return it
from the record reader. You would probably want to keep the IndexDoc
being written out as the value in dedup phase 1 (in the job config) but
change the key to your HashScore replacement class. You would need to
change HashPartitioner to partition according to your new key numeric.
The HashReducer would also need to be changed to collect only the ones
you want based on your new key numeric.
The dedup phase 2 deletes by url so if you want to remove exact urls
then you would leave it in otherwise you might want to take the job
config section for phase 2 out.
Hope this helps.
Dennis
sdeck wrote:
Hello,
I am running nutch .8 against hadoop .4, just for reference
I want to add a delete duplicate based on a similarity algorithm, as opposed
to the hash method that is currently in there.
I would have to say I am pretty lost as to how the delete duplicates class
is working.
I would guess that I need to implement a compareTo method, but I am not
really sure what to return. Also, when I do return something, where do I
implement the functionality to say "yes, these are dupes, so remove the
first one)
Can anyone help out?
Thanks,
S