Hi Devdatta, I had come across the legacy Devnagri fonts issue earlier when I started working on budget data. The fonts are Shree-Dev, Kruti-Dev, Shivaji, etc : legacy fonts used in an era when unicode devnagri wasn't invented, and to get around, there was simple substitution like a = क etc. I've put up a graphic that shows this mapping for a few fonts : http://i.imgur.com/ICUC6Wk.png
I found a group named technical-hindi who have been working on simple javascript pages that convert these fonts to unicode devnagri (and back!). I used them, and with the content I had, I had to introduce some extra conversions, and it worked like a charm. Their site where many converters are shared : https://sites.google.com/site/technicalhindi/home/converters Their google group: https://groups.google.com/forum/#!forum/technical-hindi I've shared the modified converters I used here: http://ourpuneourbudget.in/tools/ (only had those limited use cases) In the process of studying these, I came upon an unexpected situation : If the document you are extracting data from is a PDF (which I also refer to as "digital graveyard"), then it is PREFERABLE if the fonts are in legacy Devnagri font rather than Unicode font! That's because as of today (or 2015 when I came across it), PDF technology doesn't handle unicode Devnagri well. Some distortions are done to make the glyphs "print" properly, which permanently distorts the original chars. The issue is described here: https://stackoverflow.com/questions/30756193/unable-to-copy-exact-hindi-content-from-pdf ..So if the text in the PDFs you're working on is in legacy Devnagri instead of Unicode Devnagri, then you're actually lucky :P . If it's in unicode then that PDF is a true digital graveyard :P. OCR can work, yes, but please tell me if you find a way to OCR a page table cell by table cell separately instead of jumbling up everything. I had also come across a project like yours an year ago but I backed out because I could not get around this issue.. the fonts in the PDF were in Unicode. Here's an issue I filed in the Tabula project related to this, and they fixed it for the legacy fonts extraction at least. https://github.com/tabulapdf/tabula/issues/303 -- Cheers, Nikhil VJ +91-966-583-1250 Pune / Mandangad, India DataMeet Pune chapter <https://datameet-pune.github.io/> Self-designed learner at Swaraj University <http://www.swarajuniversity.org> Blog <http://nikhilsheth.blogspot.in> On Sat, Aug 19, 2017 at 11:21 PM, Raphael Susewind < li...@raphael-susewind.de> wrote: > Hi Devdatta, > > I had run into the same issue, and indeed the only workaround is OCR. > Its not just a different encoding than unicode - its actually garbled > CMaps, which is much worse (ie not recoverable). > > See my comments here for starters (and the badly written scripts): > > https://github.com/raphael-susewind/india-religion-politics/tree/master/ > maharolls2014 > > As for Soundex, you might want to take a look at the IndicSoundex > collection, which is more accurate than transliteration into latin > followed by English soundex: > > http://libindic.org/Soundex > > Good news is that I have done the whole exercise for Maharashtra 2014, > and may be able to share depending on what your project is about. > Perhaps send me a PM and we can discuss further, > > Best, > Raphael > > On 08/19/2017 06:14 PM, Devdatta Tengshe wrote: > > I'm attempting to read Names, Ages & Genders from Electoral Rolls, so > > that I can create a database of Names, to figure out the General Spread > > of Specific Names across locations, and ages. > > > > I began working with Mumbai's rolls, and am running into the following > > issues: > > > > 1) The Electoral Rolls are not in English, but in Devanagari. This is > > not a Major issue, because I could transliterate it into English for > > Comparison (I need the names to be in English, so that I can use Soundex > > to remove misspellings etc). I know libraries for transliteratation that > > work with Devanagari (Hindi & Marathi). Is there anything similar for > > other scripts such as Kannada & Tamil etc? > > > > 2)While the Rolls are in Devanagari, the text is not actually in > > Unicode. It is in some other font, and hence when I Get the text out, > > it's garbage. Since Others have worked with the rolls before, is there a > > better way to get the Text Out? > > > > 3)If it's not possible to get the Text out, Can we use OCR? What OCR > > library is best at working with Indic Scripts? > > > > If anyone has some experience to share on these issues, it will be much > > appreciated. > > > > -- > > Datameet is a community of Data Science enthusiasts in India. Know more > > about us by visiting http://datameet.org > > --- > > You received this message because you are subscribed to the Google > > Groups "datameet" group. > > To unsubscribe from this group and stop receiving emails from it, send > > an email to datameet+unsubscr...@googlegroups.com > > <mailto:datameet+unsubscr...@googlegroups.com>. > > For more options, visit https://groups.google.com/d/optout. > > -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to datameet+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.