If I am understanding what you are asking, in the getRecordReader method 
of the InputFormat innner class in DeleteDuplicates it gets the hash 
score from the document.  You could put your algorithm there and return 
some type of numeric value based on analysis of the document fields.  
You would need to write a different class for HashScore and return it 
from the record reader.  You would probably want to keep the IndexDoc 
being written out as the value in dedup phase 1 (in the job config) but 
change the key to your HashScore replacement class.  You would need to 
change HashPartitioner to partition according to your new key numeric.  
The HashReducer would also need to be changed to collect only the ones 
you want based on your new key numeric. 

The dedup phase 2 deletes by url so if you want to remove exact urls 
then you would leave it in otherwise you might want to take the job 
config section for phase 2 out.

Hope this helps.

Dennis

sdeck wrote:
> Hello,
>   I am running nutch .8 against hadoop .4, just for reference
> I want to add a delete duplicate based on a similarity algorithm, as opposed
> to the hash method that is currently in there.
> I would have to say I am pretty lost as to how the delete duplicates class
> is working.
> I would guess that I need to implement a compareTo method, but I am not
> really sure what to return. Also, when I do return something, where do I
> implement the functionality to say "yes, these are dupes, so remove the
> first one)
>
> Can anyone help out?
> Thanks,
> S
>   

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to