Hi Siddarth and Nikhil, sorry for the delay, I was travelling for the past weeks. I have worked extensively with the electoral rolls, and ultimately the only solution I found for the problem of corrupted text is OCR - tesseract was the most accurate in my experiments (and the relatively fastest...). It can also be automated, though scaling up would require vast resources.
Let us know if you find an alternative (though I am sceptical), Best, Raphael On 19.09.2015 11:51, Nikhil VJ wrote: > Hi Siddharth, > > Sorry I missed this earlier. > In April this year I converted a budget PDF to excel that had Marathi > content, in legacy font (similar to ShreeDev). It was two-step : first > extract to excel, and then replace all the text after passing through a > legacy font to unicode converter (an HTML file with javascript) > > http://nikhilsheth.blogspot.in/2015/05/diy-pdf-to-excel-spreadsheet-conversion.html > > Just check your document or send me a copy.. if it has legacy fonts then > copy-pasting from it gives us random english letters and punctuations. > It it's unicode, then copy-pasting gives us unicode text only, but > inaccurate. It's possible that someone might have made a converter for > this; if not, then if you have enough content then you could make your > own converter. > > If the PDF has Unicode font in it, then my method fails. > > I wasn't aware of the stackoverflow questions you've linked to. Great > insights here into why Unicode extraction is failing. > > If it's less pages then this free online multi-language OCR tool might > help: http://www.i2ocr.com/free-online-hindi-ocr > (per page time-taking process, so only advisable if content is less or > if you have a slave army of interns at your disposal :P) > > > > > -- > Cheers, > Nikhil > +91-966-583-1250 > Pune, India > Self-designed learner at Swaraj University <http://www.swarajuniversity.org> > http://nikhilsheth.blogspot.in > > > > > > On Tue, Sep 1, 2015 at 7:37 PM, Siddharth Vijayakrishnan > <[email protected] <mailto:[email protected]>> wrote: > > Hi, > > I downloaded a few files containing voter rolls and tried to parse > the PDFs using pdfminer. Ran straight into a problem[1] where the > glyphs are converted to unicode using a wrong character map. Before > I try and solve this on my own, I wonder if anyone in this community > has a readymade solution ? > > [1] > > http://stackoverflow.com/questions/31876415/parsing-a-pdfdevanagari-script-using-pdfminer-gives-incorrect-output > > -- > Datameet is a community of Data Science enthusiasts in India. Know > more about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google > Groups "datameet" group. > To unsubscribe from this group and stop receiving emails from it, > send an email to [email protected] > <mailto:datameet%[email protected]>. > For more options, visit https://groups.google.com/d/optout. > > > -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google > Groups "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send > an email to [email protected] > <mailto:[email protected]>. > For more options, visit https://groups.google.com/d/optout. -- Dr. Raphael Susewind | Political anthropologist, Associate CSASP Oxford Snail Mail | Melanchthonstr. 4a, 33615 Bielefeld, Germany Web & Twitter | http://www.raphael-susewind.de | @RaphaelSusewind Please do consider http://www.gnupg.org for encryption (key id 10AEE42F) -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
