Arber,

I don't have any links to papers handy, unfortunately. Quite honestly there is a TON of research on this subject. My recommendation is to dig around ACM, you can find many papers related to duplicate detection. If you don't have an ACM membership to the archives, digging around Google should still yield some results.

Generally an online dupe detection system would take advantage of some kind of "signature" or dimensional reduction that permits a level of fuzzy matching. The implementations vary greatly depending on the domain, for example near-duplicate image detection is a heavily researched field as well as text-based.

As I said, this topic is well beyond the scope of this mailing list. A bit of legwork should yield more papers than you can possibly read :)

JG

Yabo-Arber Xu wrote:
Hi JG,

Sorry for interrupting the ongoing topic, but I am quite interested in the
online dup detection method you mentioned. Could you please elaborate it a
bit, or point out some links and I will follow?

Best,
Arber


On Wed, Aug 19, 2009 at 1:51 AM, Jonathan Gray <[email protected]> wrote:

You didn't talk much about how you plan on doing dupe-detection of
questions, but there are some interesting ways to generate signatures which
could turn into your row keys, then you could actually do some kind of
online duplicate detecting of already answered questions. That's beyond the
scope of this mailing list, however.


Reply via email to