John Casey wrote: > On 10/18/06, Isabel Drost <[EMAIL PROTECTED]> wrote: >> >> Find Me wrote: >> > How to eliminate near duplicates from the index? Someone suggested >> that >> I >> > could look at the TermVectors and do a comparision to remove the >> > duplicates. >> >> As an alternative you could also have a look at the paper "Detecting >> Phrase-Level Duplication on the World Wide Web" by Dennis Fetterly, Mark >> Manasse, Marc Najork. > > > Another good reference would be Soumen Chakrabarti's reference book, > "Mining > the Web - Discovering Knowledge from Hypertext Data",2003 and the > section on > shingling and the elimination of near duplicates. Of course I think this > works at the document level rather than at the term vector level but it > might be useful to prevent duplicate documents from being indexed > altogether. > >> One major problem with this is the structure of the document is >> > no longer important. Are there any obvious pitfalls? For example: >> Document >> > A being a subset of Document B but in no particular order. >> >> I think this case is pretty unlikely. But I am not sure whether you can >> detect >> near duplicates by only comparing term-document vectors. There might be >> problems with documents with slightly changed words, words that were >> replaced >> with synonyms... >> >> However, if you want to keep some information on the word order, you >> might >> consider comparing n-gram document vectors. That is, each dimension >> in the >> >> vector does not only represent one word but a sequence of 2, 3, 4, 5... >> words. > > > > would this involve something like a window of 2-5 words around a > particular > term in a document? > > Cheers, >> Isabel >> >
DeleteDuplicates removes documents having the same digest or the same url. If you use the TextProfileSigniture instead of MD5Signiture, it will remove near similar documents. The MD5Signiture class set digest as the md5 of all the content, whereas textProfileSigniture sets digest as the md5 of significant terms. You should check the class for implementation details. also look at signitureFactory for how to change the configuration. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
