Thanks for sharing this Vishal. The workaround used in the case shared in
the doc was to extract the unique ID numbers that were not in Devnagri and
then use them to look up and download details from election commission
website.



--
Cheers,
Nikhil VJ
+91-966-583-1250
Pune / Mandangad, India
DataMeet Pune chapter <https://datameet-pune.github.io/>
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
Blog <http://nikhilsheth.blogspot.in>
Contribute <https://www.instamojo.com/@nikhilvj/>

On Thu, Sep 7, 2017 at 3:22 PM, Vishal Bhave <vishalbh...@gmail.com> wrote:

> Hi,
> I came across this he has got way out from the issue of encoding for
> devnagari post converting pdf to text/ html
>
> https://github.com/RO-29/electoral_scraper_pdf
> https://docs.google.com/document/d/1ZbY7KF4XQfJ7K3VbkcSaLW__
> sThIlnxcvLVN5nFKSUk/edit
>
> Best wishes,
> Vishal Bhave
>
>
> On Saturday, August 19, 2017 at 10:44:28 PM UTC+5:30, Devdatta Tengshe
> wrote:
>>
>> I'm attempting to read Names, Ages & Genders from Electoral Rolls, so
>> that I can create a database of Names, to figure out the General Spread of
>> Specific Names across locations, and ages.
>>
>> I began working with Mumbai's rolls, and am running into the following
>> issues:
>>
>> 1) The Electoral Rolls are not in English, but in Devanagari. This is not
>> a Major issue, because I could transliterate it into English for Comparison
>> (I need the names to be in English, so that I can use Soundex to remove
>> misspellings etc). I know libraries for transliteratation that work with
>> Devanagari (Hindi & Marathi). Is there anything similar for other scripts
>> such as Kannada & Tamil etc?
>>
>> 2)While the Rolls are in Devanagari, the text is not actually in Unicode.
>> It is in some other font, and hence when I Get the text out, it's garbage.
>> Since Others have worked with the rolls before, is there a better way to
>> get the Text Out?
>>
>> 3)If it's not possible to get the Text out, Can we use OCR? What OCR
>> library is best at working with Indic Scripts?
>>
>> If anyone has some experience to share on these issues, it will be much
>> appreciated.
>>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datameet+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to datameet+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to