sdeck wrote:
That sort of gets me there in understanding what is going on.
Still not all the way though.
So, let's look at the trunk of deleteduplicates:
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java
No where in there do I see where url == url, and if so, delete that doc from
the index.
So, I am not sure where I would put my code.
I could possibly modify the hash content reducer. Basically, here is the
algorithm approach
start at 1
loop through 2-N and take the text of 1 and compare to the text of 2, 3,
4,...N
If the similarity score is > ## then delete that document.
The way I understand the hash reducer, that is what it is doing, but I don't
really understand where the score is coming from and where the comparison is
really taking place.
I see this:
public int compareTo(Object o) {
IndexDoc that = (IndexDoc)o;
if (this.keep != that.keep) {
return this.keep ? 1 : -1;
} else if (!this.hash.equals(that.hash)) { // order first by
hash
return this.hash.compareTo(that.hash);
...
So, is that where I would place my similary score and return that value
there?
AFAIK DeleteDuplicates works like this:
IndexDoc is a presentation of the actual document in your index(IndexDoc
keeps among other things, document's url, boost and digest). It is also
Writable and Comparable which means that it can be used both as a key
and a value in MapReduce.
In the first phase of dedup, job reads the indexes and outputs
<IndexDoc.url, IndexDoc> pairs. Job's map is identity, so in reduce,
IndexDocs with same url are grouped under same reduce. Reduce outputs
these marking older versions of same urls to be deleted. (So if you
fetched the same url more than once only the newest is kept)
In phase 2, job reads this output then outputs <IndexDoc.hash, IndexDoc>
pairs. Again map is identity and reduce marks relevant ones to be
deleted. (So if you fetched same documents under different urls, only
the the one with the highest boost or the shortest url is kept).
Phase 3 reads this output then deletes all marked documents.
I think that your version will be somewhat difficult to implement.
Because, MapReduce works best on input that can be processed
independently from each other.
Hope that clears things a bit.
--
Dogacan Guney