I'd start with the more_like_this query and see how far that takes you.

clint


On 17 March 2014 18:28, Shrin King <[email protected]> wrote:

> Given a new big department merged from three departments. A few employees
> worked for two or three departments before merging. That means, the
> attributes of one person might be listed under different departments'
> databases.
> One additional problem is that one person can have different first names
> or nick names.
>
> These attributes of a person include
> first name, last name, email, home phone, cell phone, ssn, address, etc ...
>
> Because some values of the above could be empty, there is no unique
> primary key.
> Hence, we need an intelligent solution for the classification, and to put
> weights for different matching rules.
>
> Any tips to handle such deduplication tasks? Any open-source tools
> available to use?
>
>
> The database contains about 100 million records.
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/68242e72-4aff-41a9-8a45-dc726e89aab8%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/68242e72-4aff-41a9-8a45-dc726e89aab8%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPt3XKQsxjXeCbORQM6giPc-eP6y%3D-E2TOJCkh9oH3hmc%3Dq5zg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to