I'm attempting to read Names, Ages & Genders from Electoral Rolls, so that 
I can create a database of Names, to figure out the General Spread of 
Specific Names across locations, and ages.

I began working with Mumbai's rolls, and am running into the following 
issues:

1) The Electoral Rolls are not in English, but in Devanagari. This is not a 
Major issue, because I could transliterate it into English for Comparison 
(I need the names to be in English, so that I can use Soundex to remove 
misspellings etc). I know libraries for transliteratation that work with 
Devanagari (Hindi & Marathi). Is there anything similar for other scripts 
such as Kannada & Tamil etc?

2)While the Rolls are in Devanagari, the text is not actually in Unicode. 
It is in some other font, and hence when I Get the text out, it's garbage. 
Since Others have worked with the rolls before, is there a better way to 
get the Text Out?

3)If it's not possible to get the Text out, Can we use OCR? What OCR 
library is best at working with Indic Scripts?

If anyone has some experience to share on these issues, it will be much 
appreciated.

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to