Thanks very much Henok for your reply. I would be very much interested in your thesis, and any code that you may provide. Is your thesis published online? Is it in english? Your approach seems very interesting, and I would be very interested in looking at the details. Some ideas I had were using the cosine similarity approach, or perhaps n-grams, since the quotations that I am comparing are not as big as an entire document/article.
I would be very happy to hear from you. My email address is fayyazud...@gmail.com. Also, were you able to implement your solution online? If you can provide me with you a url, that would be great. I look forward to hearing from you soon. Sincerely; Fayyaz henok sahilu wrote: > > i have a thesis work which i have done. it was on lega documents. the XML > IR systems are very susceptible for producing duplicate or near duplicate > contents (not in concept, but in textual content ). > here is what i did . > i tag each article content in the legal documents, with their status, and > their relationship with other article contents. > and then write a parer that will read this tags and index the contents > therein. > and write a re-ranking algorithm that works based on the staus information > of article contents and their relationship. > for example > the one that contians an active law will be boosted, because it is the > active law that prevails in matters. > and some times article are replaced by other articles , in this case > rather than presenting both them (which can result in duplicates ) i > compare new terms used in the process of replacing with the query terms > and boost the replacing article content. or i compare the terms that are > exclusively used by the old article content with the query terms and boost > the old article content. > the repealed article contents are downwwited by some factor rather than > presenting their them along with their other version . > this is what i have done to reduce the number of duplicating or near > duplicating search results to users.other wise the user can waste > considerable time inspecting which is duplicate and which is not . > i can give u the the abstract and come codes (in java) id u want > i think this might help > henok > > --- On Wed, 9/16/09, syedfa <fayyazud...@gmail.com> wrote: > > From: syedfa <fayyazud...@gmail.com> > Subject: Finding duplicate records from a result set > To: java-user@lucene.apache.org > Date: Wednesday, September 16, 2009, 1:52 AM > > > Dear Fellow Java/Lucene developers: > > One annoyance that people have when searching for information online is > the > occurance of duplicate records (i.e. multiple sites that carry news feeds > from the SAME news source like reuters or the associated press, and do not > provide any additional pieces of information). This becomes an issue for > the user, as they would like to sift through all the duplicates and only > search through only the unique hits. In my application that I am working > on, I realize that this is extremely common. I have various books in xml > format that contain quotations, of which many are also listed in other > collections (i.e. the narrator, and the text of the quotation are almost > exactly the same. Because the books have been translated into english by > different authors, the quotations from each collection differ slightly > from > one another. The quotations are being reported by multiple sources). > What > I would like to do in my application, using either Lucene, is to return a > set of results, such that if a user searches for a particular keyword (or > uses multiple keywords), then the result set should list any quote that is > reported from multiple sources only once, and underneath that quote, > simply > list all the references from the other collections where it is found, > instead of listing the exact same quote in the result set, multiple times. > > For example, if John Doe said, "blah blah blah", which is found in the > sources A, B, and C, if a user searched for "blah blah blah", then I want > the result set to show: > > 1. Narrator: John Doe > Quote: "blah blah blah" > Reference: A, B, C > > and NOT like the following: > > 1. Narrator: John Doe > Quote: "blah blah blah" > Reference: A > > 2. Narrator: John Doe > Quote: "blah blah blah" > Reference: B > > 3. Narrator: John Doe > Quote: "blah blah blah" > Reference: C > > I would imagine that this is a known issue in information retrieval, and I > am wondering if you have been able to solve/address this issue in Java > using > Lucene? What would you advise? > > Thanks to everyone for your time and patience. > -- > View this message in context: > http://www.nabble.com/Finding-duplicate-records-from-a-result-set-tp25468423p25468423.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > -- View this message in context: http://www.nabble.com/Finding-duplicate-records-from-a-result-set-tp25468423p25469429.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org