I'd start with the more_like_this query and see how far that takes you. clint
On 17 March 2014 18:28, Shrin King <[email protected]> wrote: > Given a new big department merged from three departments. A few employees > worked for two or three departments before merging. That means, the > attributes of one person might be listed under different departments' > databases. > One additional problem is that one person can have different first names > or nick names. > > These attributes of a person include > first name, last name, email, home phone, cell phone, ssn, address, etc ... > > Because some values of the above could be empty, there is no unique > primary key. > Hence, we need an intelligent solution for the classification, and to put > weights for different matching rules. > > Any tips to handle such deduplication tasks? Any open-source tools > available to use? > > > The database contains about 100 million records. > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/68242e72-4aff-41a9-8a45-dc726e89aab8%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/68242e72-4aff-41a9-8a45-dc726e89aab8%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKQsxjXeCbORQM6giPc-eP6y%3D-E2TOJCkh9oH3hmc%3Dq5zg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
