Thank you all for your replies.

@Raphael: Thanks for confirming that OCR is the only way forward. I
suspected as much, but now the Path is clear.

Thank you for your offer of sharing the data, but the objective of this
project is for me to get my hands dirty and learn something new, so I
pretty much have to struggle and extract the data myself.

@Nikhil:
As Raphael mentioned, the data is stored as Unicode, but with garbled
CMaps. Hence my Name in the List, which appears as:

​
Becomes  'तरगशदददवदत शरकपषण' when you extract it as text. They are
displyaing the data in some 'CDAC_GISTSurekh' font. This problem has also
been faced before and explained in this answer:
https://stackoverflow.com/questions/15385270/read-pdf-using-itextsharp-where-pdf-language-is-non-english/15566820#15566820

The Issue which you mentioned definitely needs solving, but it is a
separate issue from the one at hand.

I'll keep the group informed about any progress I make.


Regards,
Devdatta

On Sun, Aug 20, 2017 at 9:10 PM, nishadh <nishad...@gmail.com> wrote:

> Oops the vowls issues are the same as Nikhil pointed in
> https://github.com/tabulapdf/tabula/issues/303
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Ftabulapdf%2Ftabula%2Fissues%2F303&sa=D&sntz=1&usg=AFQjCNGlHVhSmPVepBCIh7BC-icjqHiIhg>.
> Sorry I have, very limited know on Devanagari fonts.
>
>
> On Sunday, August 20, 2017 at 8:55:01 PM UTC+5:30, nishadh wrote:
>>
>> Hi,
>>
>> There is a python based wrapper for Tabula https://github.com/chez
>> ou/tabula-py. It converts pdf tables into pandas dataframe. I tried with
>> a sample electrol role pdf from https://ceo.maharashtra.gov.in
>> /Search/SearchPDF.aspx and it does converted a single page table into
>> pandas data frame. It has to use encoding with 'utf-8' to convert the
>> dataframe output into csv. In Jupyter notebook and csv file, the devnagiri
>> fonts were as similar as in the pdf, however I could find vowels are
>> missing in the print, a close observation could sort this. May be pre
>> processing the pdf with conversion into single pages(it is mandatory,
>> taking few seconds for even single page) or single electrol entry table
>> cropping could fetch better results, for that library pyPdf is good help.
>>
>>
>>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datameet+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to datameet+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to