Hi Ram, I'm not sure about R, but if you have the list in an excel / csv then OpenRefine can help you iron it all out in a jiffy. Check out this article I've written that explains the flow for this particular task: http://datameet.org/2018/06/13/openrefine-bus-stop/
OpenRefine is a tool made for non-coders to clean up messy data. Site: https://openrefine.org/ -- Cheers, Nikhil VJ https://nikhilvj.co.in On Wed, Aug 26, 2020 at 6:21 AM [email protected] <[email protected]> wrote: > Hi Ram > > In addition to the helpful suggestions made above, here are some > R-specific pointers: > — stringr is an extremely helpful package with which to do most of the > string manipulation actions (whitespace removal, tokenisation, regex > matching) recommended above. > — you may also need a package that helps you compute ‘distances’ between > the strings you are comparing. stringdist is one such package. However, > with Indian names, I found some of the phonetic distance algorithms > (rogerroot, soundex) in the phonics package much more helpful. > > Hope this helps! Good luck! > Madhu > > On Wednesday, 26 August 2020 at 00:48:45 UTC+5:30 [email protected] > wrote: > >> Hi Ram, >> >> Faced with similar issues, the following worked for me - >> >> 1. Make everything lower or upper case using tolower/ toupper >> 2. Grep to match the common pattern of name >> >> Best, >> Sudatta >> >> On Aug 25, 2020, at 7:52 AM, Rahul Gupta <[email protected]> wrote: >> >> Hi Ram, >> >> Not sure if there is something very similar to FuzzyWuzzy (Python) in R. >> But you can try this link >> https://astrostatistics.psu.edu/su07/R/html/base/html/agrep.html >> >> It is similar kind of approximate string matching. You can set your own >> threshold criteria and filter data accordingly. >> >> On Tue, 25 Aug, 2020, 8:09 pm [email protected], <[email protected]> >> wrote: >> >>> Hi, >>> >>> I have collected hospital data from multiple sources. However, each >>> source have different name. Trying to clean list with no duplicates. I am >>> using R and couldn't resolve with stringdist_join . Appreciate you >>> suggesting some approach. >>> >>> For example, Guntur (A.P) is listed with following names. Can we mark >>> (or eliminate) duplicate? >>> >>> Example 1 >>> SANKARA EYE HOSPITAL(GUNTUR) >>> SANKARA EYE HOSPITAL >>> SANKARA EYE HOSPITAL ( A UNIT OF SRI KANCHI KAMA KOTI MEDICAL TRUST) >>> >>> >>> Example 2 >>> ASHIRWAD HEART HOSPITAL ( GHATKOPAR ) >>> Ashirwad Heart Hospital >>> ASHIRWAD HEART HOSPITAL ( GHATKOPAR ) >>> Ashirwad Heart Hospita-Ghatkopar >>> >>> Thanks >>> Ram >>> >>> -- >>> Datameet is a community of Data Science enthusiasts in India. Know more >>> about us by visiting http://datameet.org >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "datameet" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/datameet/19ee8101-84ec-42b0-974a-43035b5902f1n%40googlegroups.com >>> <https://groups.google.com/d/msgid/datameet/19ee8101-84ec-42b0-974a-43035b5902f1n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> Datameet is a community of Data Science enthusiasts in India. Know more >> about us by visiting http://datameet.org >> --- >> You received this message because you are subscribed to the Google Groups >> "datameet" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/datameet/CAKxLuZeB5_2K4Td%3DP8-_AjFob9Wp2Vc9jic649HD%2BV1itEpYfg%40mail.gmail.com >> <https://groups.google.com/d/msgid/datameet/CAKxLuZeB5_2K4Td%3DP8-_AjFob9Wp2Vc9jic649HD%2BV1itEpYfg%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/datameet/ccf8287d-4b7e-4fe3-8efd-b15614f7f056n%40googlegroups.com > <https://groups.google.com/d/msgid/datameet/ccf8287d-4b7e-4fe3-8efd-b15614f7f056n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CAH7jeuNN2Pxid63uZfGwu%2BA7-ZCyTyweJkSkM8H%3D_HcmP5RZ0w%40mail.gmail.com.
