qi wu wrote:
> Hi,
> I found many pages with the same title , page contents are almost same. I
> would like to index the pages with the same title only once.How can I
> recognize the pages with same title during indexing process?
> How do nutch remove pages with same page content and in which class/package
> can I find the code?
>
> Thanks
> -Qi
>
Hi,
Normally, in the nutch processing sequence, after indexing you can run
dedup command to delete the duplicate entries from the index.
DeleteDuplicates class does this in a two phrase manner. In the first
phrase the documents with the same url are deleted and in the second the
documents with the same content are deleted. In your case, I assume that
the document urls are different but the contents are "nearly the same".
Document similarity is computed using either MD5Signature or
TextProfileSignature. md5signature computes a value based on the content
of the page, but if the page's contents are not exactly the same, it
will generate distinct signatures. However TextProfileSignature
generates a signature based on the most frequent terms of the content,
so pages with similar content will generate same signature.
I can recommend two options. First one is to use the
TextProfilSignature(you can change the signiture from the
configuration), the other is to modify the DeleteDuplicates code for
deleting duplicates by the title. IMO former method is more sensible.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general