Re: [datameet] Help with R logic - near similar name

Nikhil VJ Tue, 25 Aug 2020 23:50:12 -0700

Hi Ram,

I'm not sure about R, but if you have the list in an excel / csv then
OpenRefine can help you iron it all out in a jiffy. Check out this article
I've written that explains the flow for this particular task:
http://datameet.org/2018/06/13/openrefine-bus-stop/


OpenRefine is a tool made for non-coders to clean up messy data. Site:
https://openrefine.org/

--
Cheers,
Nikhil VJ
https://nikhilvj.co.in


On Wed, Aug 26, 2020 at 6:21 AM [email protected] <[email protected]>
wrote:

> Hi Ram
>
> In addition to the helpful suggestions made above, here are some
> R-specific pointers:
> — stringr is an extremely helpful package with which to do most of the
> string manipulation actions (whitespace removal, tokenisation, regex
> matching) recommended above.
> — you may also need a package that helps you compute ‘distances’ between
> the strings you are comparing. stringdist is one such package. However,
> with Indian names, I found some of the phonetic distance algorithms
> (rogerroot, soundex) in the phonics package much more helpful.
>
> Hope this helps! Good luck!
> Madhu
>
> On Wednesday, 26 August 2020 at 00:48:45 UTC+5:30 [email protected]
> wrote:
>
>> Hi Ram,
>>
>> Faced with similar issues, the following worked for me -
>>
>> 1. Make everything lower or upper case using tolower/ toupper
>> 2. Grep to match the common pattern of name
>>
>> Best,
>> Sudatta
>>
>> On Aug 25, 2020, at 7:52 AM, Rahul Gupta <[email protected]> wrote:
>>
>> Hi Ram,
>>
>> Not sure if there is something very similar to FuzzyWuzzy (Python) in R.
>> But you can try this link
>> https://astrostatistics.psu.edu/su07/R/html/base/html/agrep.html
>>
>> It is similar kind of approximate string matching. You can set your own
>> threshold criteria and filter data accordingly.
>>
>> On Tue, 25 Aug, 2020, 8:09 pm [email protected], <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I have collected hospital data from multiple sources. However, each
>>> source have different name. Trying to clean list with no duplicates. I am
>>> using R and couldn't resolve with stringdist_join . Appreciate you
>>> suggesting some approach.
>>>
>>> For example, Guntur (A.P) is listed with following names. Can we mark
>>> (or eliminate) duplicate?
>>>
>>> Example 1
>>> SANKARA EYE HOSPITAL(GUNTUR)
>>> SANKARA EYE HOSPITAL
>>> SANKARA EYE HOSPITAL ( A UNIT OF SRI KANCHI KAMA KOTI MEDICAL TRUST)
>>>
>>>
>>> Example 2
>>> ASHIRWAD HEART HOSPITAL ( GHATKOPAR )
>>> Ashirwad Heart Hospital
>>> ASHIRWAD HEART HOSPITAL ( GHATKOPAR )
>>> Ashirwad Heart Hospita-Ghatkopar
>>>
>>> Thanks
>>> Ram
>>>
>>> --
>>> Datameet is a community of Data Science enthusiasts in India. Know more
>>> about us by visiting http://datameet.org
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "datameet" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/datameet/19ee8101-84ec-42b0-974a-43035b5902f1n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/datameet/19ee8101-84ec-42b0-974a-43035b5902f1n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> Datameet is a community of Data Science enthusiasts in India. Know more
>> about us by visiting http://datameet.org
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "datameet" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>>
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/datameet/CAKxLuZeB5_2K4Td%3DP8-_AjFob9Wp2Vc9jic649HD%2BV1itEpYfg%40mail.gmail.com
>> <https://groups.google.com/d/msgid/datameet/CAKxLuZeB5_2K4Td%3DP8-_AjFob9Wp2Vc9jic649HD%2BV1itEpYfg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/datameet/ccf8287d-4b7e-4fe3-8efd-b15614f7f056n%40googlegroups.com
> <https://groups.google.com/d/msgid/datameet/ccf8287d-4b7e-4fe3-8efd-b15614f7f056n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/datameet/CAH7jeuNN2Pxid63uZfGwu%2BA7-ZCyTyweJkSkM8H%3D_HcmP5RZ0w%40mail.gmail.com.

Re: [datameet] Help with R logic - near similar name

Reply via email to