Hi Pradeep,

If you have all the words in one column of an excel, then *OpenRefine* tool
can help you "iron out" the differences. It will show you a cluster of
similar looking cells, and you can decide which will be the one to go with
(you can even type in a new standardised value if all options are wrong).
It will then over-write all those cells with the one standardised value.
The rest of your data remains intact. No need of sorting, filtering etc.

You can read a basic walkthrough for this specific use case here:
http://datameet.org/2018/06/13/openrefine-bus-stop/

It uses multiple algorithms to detect similar words, similar to what search
engines and dictionaries do when you make a typo. You can modify the
algorithm options and do new scans to catch the hard-to-find ones. If there
is a false-positive, you can just ignore that and no changes will be done
to those values.


--
Cheers,
Nikhil VJ
+91-966-583-1250
Pune, India
Website <http://nikhilvj.co.in>
DataMeet Pune chapter <https://datameet-pune.github.io/>
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
Payment / Contribute <https://nikhilvj.benow.in/pay>

On Tue, Aug 14, 2018 at 8:07 AM, Venkata Pingali <[email protected]> wrote:

> Soundex is not enough. We went through metaphone and
> double-metaphone as well. The last showed the best
> performance when combined with simple ways to reduce
> the search space (e.g., names that start with the same
> alphabet).
>
> But it still had too many false positives and negatives. We ended up
> using a much simpler approach of manually labeling Top N most
> frequent names.
>
>
>
> On Tue, Aug 14, 2018 at 7:58 AM, Pradeep Bhatt <[email protected]>
> wrote:
>
>> Hi All,
>>
>> What is the best way to know if two words are phonetically similar
>>
>> e.g *Some similar *words
>>
>> Pradeep - Pradip
>> Thakkkar - Thakkar
>> Rathod - Rathor
>> Swetha - Sweta
>> bhen - ben
>> Sumandev - Sumandeb
>>
>> *Non - Similar*
>> Ramesh - Rajesh
>>
>> This is needed for spelling mistakes introduced when translating from
>> indian languages to English.
>>
>> Does Soundex work well for Indian names ?
>>
>> Regards,
>> Pradeep
>>
>>
>>
>> --
>> Datameet is a community of Data Science enthusiasts in India. Know more
>> about us by visiting http://datameet.org
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "datameet" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to