Hi Ram,
For one project I had to match a village name in one dataset with another
dataset containing ~44000 villages in Maharashtra. I had faced a similar
situation. To find exact(or closest) match I had used following tricks
from both strings to be compared:
1. remove white spaces
2. convert everything to lowercase
3. compare strings for exact match. if not found then go to next step.
4. remove everything inside(including) parentheses
5. compare strings for exact match. if not found then go to next step.
6. remove characters like { ! - _ . etc}
7. compare strings for exact match. if not found then go to next step.
8. compare two strings for levenshtein distance = 1, then for distance=2
and so on. (more the distance, lesser the accuracy of result)
Levenshtein distance: reference1
<https://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Spring2006/assignments/editdistance/Levenshtein%20Distance.htm>,
wikipedia <https://en.wikipedia.org/wiki/Levenshtein_distance>
Reference 1 mentions function 'adist' in R. I haven't used R so not much
idea about ready available functions, packages. But a quick search showed
me few more functions like adist.
This might help.
Best, Ravikant.
On Tue, Aug 25, 2020 at 8:46 PM Rahul Gupta <[email protected]>
wrote:
> Hi Ram,
>
> Not sure if there is something very similar to FuzzyWuzzy (Python) in R.
> But you can try this link
> https://astrostatistics.psu.edu/su07/R/html/base/html/agrep.html
>
> It is similar kind of approximate string matching. You can set your own
> threshold criteria and filter data accordingly.
>
> On Tue, 25 Aug, 2020, 8:09 pm [email protected], <
> [email protected]> wrote:
>
>> Hi,
>>
>> I have collected hospital data from multiple sources. However, each
>> source have different name. Trying to clean list with no duplicates. I am
>> using R and couldn't resolve with stringdist_join . Appreciate you
>> suggesting some approach.
>>
>> For example, Guntur (A.P) is listed with following names. Can we mark (or
>> eliminate) duplicate?
>>
>> Example 1
>> SANKARA EYE HOSPITAL(GUNTUR)
>> SANKARA EYE HOSPITAL
>> SANKARA EYE HOSPITAL ( A UNIT OF SRI KANCHI KAMA KOTI MEDICAL TRUST)
>>
>>
>> Example 2
>> ASHIRWAD HEART HOSPITAL ( GHATKOPAR )
>> Ashirwad Heart Hospital
>> ASHIRWAD HEART HOSPITAL ( GHATKOPAR )
>> Ashirwad Heart Hospita-Ghatkopar
>>
>> Thanks
>> Ram
>>
>> --
>> Datameet is a community of Data Science enthusiasts in India. Know more
>> about us by visiting http://datameet.org
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "datameet" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/datameet/19ee8101-84ec-42b0-974a-43035b5902f1n%40googlegroups.com
>> <https://groups.google.com/d/msgid/datameet/19ee8101-84ec-42b0-974a-43035b5902f1n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/datameet/CAKxLuZeB5_2K4Td%3DP8-_AjFob9Wp2Vc9jic649HD%2BV1itEpYfg%40mail.gmail.com
> <https://groups.google.com/d/msgid/datameet/CAKxLuZeB5_2K4Td%3DP8-_AjFob9Wp2Vc9jic649HD%2BV1itEpYfg%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
--
Datameet is a community of Data Science enthusiasts in India. Know more about
us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/datameet/CALt%2BO6x2jJKhiag%3DJG%3DQdBf%3Du1Z4ug2T8tQ7ztjmtLC%3DqAUzeQ%40mail.gmail.com.