Re: [Nutch-general] Need help with deleteduplicates

Dennis Kubes Fri, 29 Dec 2006 09:41:28 -0800

Some of it is happening behind the scenes. A hash of the text of theindex document is created when the documents are read. The MapReduceprocess uses the InputFormat static inner class to read documents into aMap class. Here the Map class is not specified so it is anIdentityMapper which basically just passed content through. In thiscase it is a url->document mapping becase that was the input from theRecordReader in the InputFormat.

The values passed to the UrlReducer would have all documents with theexact same url. This class then gets only the most recent url anddiscards the others storing the output as a hash->document structure.The hash is the MD5 hash of the contents of the web page or document.This is then used as input for a second MapReduce job. Again no Mapperclass so Identity which just passes through. The values passed to theHashReducer would be all documents with the exact same content (i.e.Hash). It keeps the one with the highest score (created at indexingtime) and discards the others. The output from the HashReducer is thenused as input for a a third MapReduce job that actually deletes from theindex files.

Forget about the compare to because it wouldn't be a straight comparisonlight object compareTo object. What you would want to do is to eitherchange the Record reader to analyze the document in your specific wayand have another field in the IndexDoc which is your numericrepresentation of the similarity comparison. Then in the UrlReducer youwould want to collect your numeric as the key. Then in the HashReducermake your comparison of which document to keep and not to keep based onthe similarity numeric. Remember that similar urls would need to returnthe same numeric so that they get passed to the HashReducer class as asingle set of values.


Dennis

sdeck wrote:

That sort of gets me there in understanding what is going on.
Still not all the way though.
So, let's look at the trunk of deleteduplicates:
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java

No where in there do I see where url == url, and if so, delete that doc from
the index.
So, I am not sure where I would put my code.

I could possibly modify the hash content reducer. Basically, here is the

algorithm approach

start at 1
loop through 2-N and take the text of 1 and compare to the text of 2, 3,
4,...N
If the similarity score is > ## then delete that document.

The way I understand the hash reducer, that is what it is doing, but I don't
really understand where the score is coming from and where the comparison is
really taking place.

The score is the calculated score of the indexed document. This scoreis partially created at the time the page was indexed.

I see this:
public int compareTo(Object o) {
      IndexDoc that = (IndexDoc)o;
      if (this.keep != that.keep) {
return this.keep ? 1 : -1;} else if (!this.hash.equals(that.hash)) { // order first by
hash
        return this.hash.compareTo(that.hash);
...


So, is that where I would place my similary score and return that value
there?




Dennis Kubes wrote:
If I am understanding what you are asking, in the getRecordReader methodof the InputFormat innner class in DeleteDuplicates it gets the hashscore from the document. You could put your algorithm there and returnsome type of numeric value based on analysis of the document fields.You would need to write a different class for HashScore and return itfrom the record reader. You would probably want to keep the IndexDocbeing written out as the value in dedup phase 1 (in the job config) butchange the key to your HashScore replacement class. You would need tochange HashPartitioner to partition according to your new key numeric.The HashReducer would also need to be changed to collect only the onesyou want based on your new key numeric.The dedup phase 2 deletes by url so if you want to remove exact urlsthen you would leave it in otherwise you might want to take the jobconfig section for phase 2 out.
Hope this helps.

Dennis

sdeck wrote:
Hello,
  I am running nutch .8 against hadoop .4, just for reference
I want to add a delete duplicate based on a similarity algorithm, as
opposed
to the hash method that is currently in there.
I would have to say I am pretty lost as to how the delete duplicates
class
is working.
I would guess that I need to implement a compareTo method, but I am not
really sure what to return. Also, when I do return something, where do I
implement the functionality to say "yes, these are dupes, so remove the
first one)

Can anyone help out?
Thanks,
S

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Need help with deleteduplicates

Reply via email to