qi wu wrote: > Hi, > I found many pages with the same title , page contents are almost same. I > would like to index the pages with the same title only once.How can I > recognize the pages with same title during indexing process? > How do nutch remove pages with same page content and in which class/package > can I find the code? > > Thanks > -Qi > Hi,
Normally, in the nutch processing sequence, after indexing you can run dedup command to delete the duplicate entries from the index. DeleteDuplicates class does this in a two phrase manner. In the first phrase the documents with the same url are deleted and in the second the documents with the same content are deleted. In your case, I assume that the document urls are different but the contents are "nearly the same". Document similarity is computed using either MD5Signature or TextProfileSignature. md5signature computes a value based on the content of the page, but if the page's contents are not exactly the same, it will generate distinct signatures. However TextProfileSignature generates a signature based on the most frequent terms of the content, so pages with similar content will generate same signature. I can recommend two options. First one is to use the TextProfilSignature(you can change the signiture from the configuration), the other is to modify the DeleteDuplicates code for deleting duplicates by the title. IMO former method is more sensible.
