Some of it is happening behind the scenes. A hash of the text of the index document is created when the documents are read. The MapReduce process uses the InputFormat static inner class to read documents into a Map class. Here the Map class is not specified so it is an IdentityMapper which basically just passed content through. In this case it is a url->document mapping becase that was the input from the RecordReader in the InputFormat.

The values passed to the UrlReducer would have all documents with the exact same url. This class then gets only the most recent url and discards the others storing the output as a hash->document structure. The hash is the MD5 hash of the contents of the web page or document. This is then used as input for a second MapReduce job. Again no Mapper class so Identity which just passes through. The values passed to the HashReducer would be all documents with the exact same content (i.e. Hash). It keeps the one with the highest score (created at indexing time) and discards the others. The output from the HashReducer is then used as input for a a third MapReduce job that actually deletes from the index files.

Forget about the compare to because it wouldn't be a straight comparison light object compareTo object. What you would want to do is to either change the Record reader to analyze the document in your specific way and have another field in the IndexDoc which is your numeric representation of the similarity comparison. Then in the UrlReducer you would want to collect your numeric as the key. Then in the HashReducer make your comparison of which document to keep and not to keep based on the similarity numeric. Remember that similar urls would need to return the same numeric so that they get passed to the HashReducer class as a single set of values.

Dennis

sdeck wrote:
That sort of gets me there in understanding what is going on.
Still not all the way though.
So, let's look at the trunk of deleteduplicates:
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java

No where in there do I see where url == url, and if so, delete that doc from
the index.
So, I am not sure where I would put my code.
I could possibly modify the hash content reducer. Basically, here is the
algorithm approach

start at 1
loop through 2-N and take the text of 1 and compare to the text of 2, 3,
4,...N
If the similarity score is > ## then delete that document.

The way I understand the hash reducer, that is what it is doing, but I don't
really understand where the score is coming from and where the comparison is
really taking place.
The score is the calculated score of the indexed document. This score is partially created at the time the page was indexed.
I see this:
public int compareTo(Object o) {
      IndexDoc that = (IndexDoc)o;
      if (this.keep != that.keep) {
return this.keep ? 1 : -1; } else if (!this.hash.equals(that.hash)) { // order first by
hash
        return this.hash.compareTo(that.hash);
...


So, is that where I would place my similary score and return that value
there?




Dennis Kubes wrote:
If I am understanding what you are asking, in the getRecordReader method of the InputFormat innner class in DeleteDuplicates it gets the hash score from the document. You could put your algorithm there and return some type of numeric value based on analysis of the document fields. You would need to write a different class for HashScore and return it from the record reader. You would probably want to keep the IndexDoc being written out as the value in dedup phase 1 (in the job config) but change the key to your HashScore replacement class. You would need to change HashPartitioner to partition according to your new key numeric. The HashReducer would also need to be changed to collect only the ones you want based on your new key numeric. The dedup phase 2 deletes by url so if you want to remove exact urls then you would leave it in otherwise you might want to take the job config section for phase 2 out.

Hope this helps.

Dennis

sdeck wrote:
Hello,
  I am running nutch .8 against hadoop .4, just for reference
I want to add a delete duplicate based on a similarity algorithm, as
opposed
to the hash method that is currently in there.
I would have to say I am pretty lost as to how the delete duplicates
class
is working.
I would guess that I need to implement a compareTo method, but I am not
really sure what to return. Also, when I do return something, where do I
implement the functionality to say "yes, these are dupes, so remove the
first one)

Can anyone help out?
Thanks,
S

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to