Re: Need help with deleteduplicates

Doğacan Güney Wed, 27 Dec 2006 00:38:00 -0800

sdeck wrote:

That sort of gets me there in understanding what is going on.
Still not all the way though.
So, let's look at the trunk of deleteduplicates:
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java


No where in there do I see where url == url, and if so, delete that doc from
the index.
So, I am not sure where I would put my code.

I could possibly modify the hash content reducer.  Basically, here is the
algorithm approach

start at 1
loop through 2-N and take the text of 1 and compare to the text of 2, 3,
4,...N
If the similarity score is > ## then delete that document.

The way I understand the hash reducer, that is what it is doing, but I don't
really understand where the score is coming from and where the comparison is
really taking place.
I see this:
public int compareTo(Object o) {
      IndexDoc that = (IndexDoc)o;
      if (this.keep != that.keep) {

return this.keep ? 1 : -1;} else if (!this.hash.equals(that.hash)) { // order first by

hash
        return this.hash.compareTo(that.hash);
...


So, is that where I would place my similary score and return that value
there?


AFAIK DeleteDuplicates works like this:

IndexDoc is a presentation of the actual document in your index(IndexDockeeps among other things, document's url, boost and digest). It is alsoWritable and Comparable which means that it can be used both as a keyand a value in MapReduce.

In the first phase of dedup, job reads the indexes and outputs<IndexDoc.url, IndexDoc> pairs. Job's map is identity, so in reduce,IndexDocs with same url are grouped under same reduce. Reduce outputsthese marking older versions of same urls to be deleted. (So if youfetched the same url more than once only the newest is kept)

In phase 2, job reads this output then outputs <IndexDoc.hash, IndexDoc>pairs. Again map is identity and reduce marks relevant ones to bedeleted. (So if you fetched same documents under different urls, onlythe the one with the highest boost or the shortest url is kept).


Phase 3 reads this output then deletes all marked documents.

I think that your version will be somewhat difficult to implement.Because, MapReduce works best on input that can be processedindependently from each other.


Hope that clears things a bit.

--
Dogacan Guney

Re: Need help with deleteduplicates

Reply via email to