Hi Devdatta,

I had come across the legacy Devnagri fonts issue earlier when I started
working on budget data. The fonts are Shree-Dev, Kruti-Dev, Shivaji, etc :
legacy fonts used in an era when unicode devnagri wasn't invented, and to
get around, there was simple substitution like a = क etc. I've put up a
graphic that shows this mapping for a few fonts :
http://i.imgur.com/ICUC6Wk.png

I found a group named technical-hindi who have been working on simple
javascript pages that convert these fonts to unicode devnagri (and back!).
I used them, and with the content I had, I had to introduce some extra
conversions, and it worked like a charm.

Their site where many converters are shared :
https://sites.google.com/site/technicalhindi/home/converters
Their google group: https://groups.google.com/forum/#!forum/technical-hindi

I've shared the modified converters I used here:
http://ourpuneourbudget.in/tools/
(only had those limited use cases)

In the process of studying these, I came upon an unexpected situation : If
the document you are extracting data from is a PDF (which I also refer to
as "digital graveyard"), then it is PREFERABLE if the fonts are in legacy
Devnagri font rather than Unicode font!

That's because as of today (or 2015 when I came across it), PDF technology
doesn't handle unicode Devnagri well. Some distortions are done to make the
glyphs "print" properly, which permanently distorts the original chars. The
issue is described here:
https://stackoverflow.com/questions/30756193/unable-to-copy-exact-hindi-content-from-pdf

..So if the text in the PDFs you're working on is in legacy Devnagri
instead of Unicode Devnagri, then you're actually lucky :P .

If it's in unicode then that PDF is a true digital graveyard :P. OCR can
work, yes, but please tell me if you find a way to OCR a page table cell by
table cell separately instead of jumbling up everything. I had also come
across a project like yours an year ago but I backed out because I could
not get around this issue.. the fonts in the PDF were in Unicode.

Here's an issue I filed in the Tabula project related to this, and they
fixed it for the legacy fonts extraction at least.
https://github.com/tabulapdf/tabula/issues/303



--
Cheers,
Nikhil VJ
+91-966-583-1250
Pune / Mandangad, India
DataMeet Pune chapter <https://datameet-pune.github.io/>
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
Blog <http://nikhilsheth.blogspot.in>

On Sat, Aug 19, 2017 at 11:21 PM, Raphael Susewind <
li...@raphael-susewind.de> wrote:

> Hi Devdatta,
>
> I had run into the same issue, and indeed the only workaround is OCR.
> Its not just a different encoding than unicode - its actually garbled
> CMaps, which is much worse (ie not recoverable).
>
> See my comments here for starters (and the badly written scripts):
>
> https://github.com/raphael-susewind/india-religion-politics/tree/master/
> maharolls2014
>
> As for Soundex, you might want to take a look at the IndicSoundex
> collection, which is more accurate than transliteration into latin
> followed by English soundex:
>
> http://libindic.org/Soundex
>
> Good news is that I have done the whole exercise for Maharashtra 2014,
> and may be able to share depending on what your project is about.
> Perhaps send me a PM and we can discuss further,
>
> Best,
> Raphael
>
> On 08/19/2017 06:14 PM, Devdatta Tengshe wrote:
> > I'm attempting to read Names, Ages & Genders from Electoral Rolls, so
> > that I can create a database of Names, to figure out the General Spread
> > of Specific Names across locations, and ages.
> >
> > I began working with Mumbai's rolls, and am running into the following
> > issues:
> >
> > 1) The Electoral Rolls are not in English, but in Devanagari. This is
> > not a Major issue, because I could transliterate it into English for
> > Comparison (I need the names to be in English, so that I can use Soundex
> > to remove misspellings etc). I know libraries for transliteratation that
> > work with Devanagari (Hindi & Marathi). Is there anything similar for
> > other scripts such as Kannada & Tamil etc?
> >
> > 2)While the Rolls are in Devanagari, the text is not actually in
> > Unicode. It is in some other font, and hence when I Get the text out,
> > it's garbage. Since Others have worked with the rolls before, is there a
> > better way to get the Text Out?
> >
> > 3)If it's not possible to get the Text out, Can we use OCR? What OCR
> > library is best at working with Indic Scripts?
> >
> > If anyone has some experience to share on these issues, it will be much
> > appreciated.
> >
> > --
> > Datameet is a community of Data Science enthusiasts in India. Know more
> > about us by visiting http://datameet.org
> > ---
> > You received this message because you are subscribed to the Google
> > Groups "datameet" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> > an email to datameet+unsubscr...@googlegroups.com
> > <mailto:datameet+unsubscr...@googlegroups.com>.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datameet+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to datameet+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to