Arber,
I don't have any links to papers handy, unfortunately. Quite honestly
there is a TON of research on this subject. My recommendation is to dig
around ACM, you can find many papers related to duplicate detection. If
you don't have an ACM membership to the archives, digging around Google
should still yield some results.
Generally an online dupe detection system would take advantage of some
kind of "signature" or dimensional reduction that permits a level of
fuzzy matching. The implementations vary greatly depending on the
domain, for example near-duplicate image detection is a heavily
researched field as well as text-based.
As I said, this topic is well beyond the scope of this mailing list. A
bit of legwork should yield more papers than you can possibly read :)
JG
Yabo-Arber Xu wrote:
Hi JG,
Sorry for interrupting the ongoing topic, but I am quite interested in the
online dup detection method you mentioned. Could you please elaborate it a
bit, or point out some links and I will follow?
Best,
Arber
On Wed, Aug 19, 2009 at 1:51 AM, Jonathan Gray <[email protected]> wrote:
You didn't talk much about how you plan on doing dupe-detection of
questions, but there are some interesting ways to generate signatures which
could turn into your row keys, then you could actually do some kind of
online duplicate detecting of already answered questions. That's beyond the
scope of this mailing list, however.