Thank you Dilawar, Rahul, Ravikant, Sudatta, Madhu, Nikhil: I mix-matched all the options you suggested. Finally, I have 18k hospital list in India. I will be providing this data from http://india-data.com/ , where people can search information by Pincode. Beta version is live http://india-data.com/pincode/221107/ .
Thanks again to all. Regards Ram On Wednesday, 26 August 2020 at 02:50:09 UTC-4 [email protected] wrote: > Hi Ram, > > I'm not sure about R, but if you have the list in an excel / csv then > OpenRefine can help you iron it all out in a jiffy. Check out this article > I've written that explains the flow for this particular task: > http://datameet.org/2018/06/13/openrefine-bus-stop/ > > OpenRefine is a tool made for non-coders to clean up messy data. Site: > https://openrefine.org/ > > -- > Cheers, > Nikhil VJ > https://nikhilvj.co.in > > > On Wed, Aug 26, 2020 at 6:21 AM [email protected] <[email protected]> > wrote: > >> Hi Ram >> >> In addition to the helpful suggestions made above, here are some >> R-specific pointers: >> — stringr is an extremely helpful package with which to do most of the >> string manipulation actions (whitespace removal, tokenisation, regex >> matching) recommended above. >> — you may also need a package that helps you compute ‘distances’ between >> the strings you are comparing. stringdist is one such package. However, >> with Indian names, I found some of the phonetic distance algorithms >> (rogerroot, soundex) in the phonics package much more helpful. >> >> Hope this helps! Good luck! >> Madhu >> >> On Wednesday, 26 August 2020 at 00:48:45 UTC+5:30 [email protected] >> wrote: >> >>> Hi Ram, >>> >>> Faced with similar issues, the following worked for me - >>> >>> 1. Make everything lower or upper case using tolower/ toupper >>> 2. Grep to match the common pattern of name >>> >>> Best, >>> Sudatta >>> >>> On Aug 25, 2020, at 7:52 AM, Rahul Gupta <[email protected]> wrote: >>> >>> Hi Ram, >>> >>> Not sure if there is something very similar to FuzzyWuzzy (Python) in R. >>> But you can try this link >>> https://astrostatistics.psu.edu/su07/R/html/base/html/agrep.html >>> >>> It is similar kind of approximate string matching. You can set your own >>> threshold criteria and filter data accordingly. >>> >>> On Tue, 25 Aug, 2020, 8:09 pm [email protected], < >>> [email protected]> wrote: >>> >>>> Hi, >>>> >>>> I have collected hospital data from multiple sources. However, each >>>> source have different name. Trying to clean list with no duplicates. I am >>>> using R and couldn't resolve with stringdist_join . Appreciate you >>>> suggesting some approach. >>>> >>>> For example, Guntur (A.P) is listed with following names. Can we mark >>>> (or eliminate) duplicate? >>>> >>>> Example 1 >>>> SANKARA EYE HOSPITAL(GUNTUR) >>>> SANKARA EYE HOSPITAL >>>> SANKARA EYE HOSPITAL ( A UNIT OF SRI KANCHI KAMA KOTI MEDICAL TRUST) >>>> >>>> >>>> Example 2 >>>> ASHIRWAD HEART HOSPITAL ( GHATKOPAR ) >>>> Ashirwad Heart Hospital >>>> ASHIRWAD HEART HOSPITAL ( GHATKOPAR ) >>>> Ashirwad Heart Hospita-Ghatkopar >>>> >>>> Thanks >>>> Ram >>>> >>>> -- >>>> Datameet is a community of Data Science enthusiasts in India. Know more >>>> about us by visiting http://datameet.org >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "datameet" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/datameet/19ee8101-84ec-42b0-974a-43035b5902f1n%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/datameet/19ee8101-84ec-42b0-974a-43035b5902f1n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >>> Datameet is a community of Data Science enthusiasts in India. Know more >>> about us by visiting http://datameet.org >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "datameet" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/datameet/CAKxLuZeB5_2K4Td%3DP8-_AjFob9Wp2Vc9jic649HD%2BV1itEpYfg%40mail.gmail.com >>> >>> <https://groups.google.com/d/msgid/datameet/CAKxLuZeB5_2K4Td%3DP8-_AjFob9Wp2Vc9jic649HD%2BV1itEpYfg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> -- >> Datameet is a community of Data Science enthusiasts in India. Know more >> about us by visiting http://datameet.org >> --- >> You received this message because you are subscribed to the Google Groups >> "datameet" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/datameet/ccf8287d-4b7e-4fe3-8efd-b15614f7f056n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/datameet/ccf8287d-4b7e-4fe3-8efd-b15614f7f056n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/401db8e5-feb0-4ccd-a942-734df8d4f0ban%40googlegroups.com.
