I'm attempting to read Names, Ages & Genders from Electoral Rolls, so that I can create a database of Names, to figure out the General Spread of Specific Names across locations, and ages.
I began working with Mumbai's rolls, and am running into the following issues: 1) The Electoral Rolls are not in English, but in Devanagari. This is not a Major issue, because I could transliterate it into English for Comparison (I need the names to be in English, so that I can use Soundex to remove misspellings etc). I know libraries for transliteratation that work with Devanagari (Hindi & Marathi). Is there anything similar for other scripts such as Kannada & Tamil etc? 2)While the Rolls are in Devanagari, the text is not actually in Unicode. It is in some other font, and hence when I Get the text out, it's garbage. Since Others have worked with the rolls before, is there a better way to get the Text Out? 3)If it's not possible to get the Text out, Can we use OCR? What OCR library is best at working with Indic Scripts? If anyone has some experience to share on these issues, it will be much appreciated. -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
