Hi Raphael,

Thanks for sharing about Tesseract: it always helps to know what's in the
engines ~:)

I wish we had a way of OCR'ing tabular documents. Tabula's interface
combined with OCR.
I created a feature request on Tabula for this :
https://github.com/tabulapdf/tabula/issues/409
Let's hope it gets some love! Please +1 it!

Siddharth, you should share at least a one page PDF sample of what you're
working with, we'll be able to see which way is best for what you've got.

If one goes the OCR way, we might need to convert the target PDF to image
format. There are quite some online sites for doing that, but it gets
tricky when using non-English script. If you're on a linux OS, then
*pdftoppm* is a good command line tool to use.

Sample command: pdftoppm -rx 200 -ry 200 -png b.pdf b
(200 sets DPI.. I found this to be best with the docs I was doing)


--
Cheers,
Nikhil
+91-966-583-1250
Pune, India
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
http://nikhilsheth.blogspot.in





On Mon, Sep 21, 2015 at 6:35 PM, Raphael Susewind <[email protected]
> wrote:

> Hi Siddarth and Nikhil,
>
> sorry for the delay, I was travelling for the past weeks. I have worked
> extensively with the electoral rolls, and ultimately the only solution I
> found for the problem of corrupted text is OCR - tesseract was the most
> accurate in my experiments (and the relatively fastest...). It can also
> be automated, though scaling up would require vast resources.
>
> Let us know if you find an alternative (though I am sceptical),
>
> Best,
> Raphael
>
> On 19.09.2015 11:51, Nikhil VJ wrote:
> > Hi Siddharth,
> >
> > Sorry I missed this earlier.
> > In April this year I converted a budget PDF to excel that had Marathi
> > content, in legacy font (similar to ShreeDev). It was two-step : first
> > extract to excel, and then replace all the text after passing through a
> > legacy font to unicode converter (an HTML file with javascript)
> >
> >
> http://nikhilsheth.blogspot.in/2015/05/diy-pdf-to-excel-spreadsheet-conversion.html
> >
> > Just check your document or send me a copy.. if it has legacy fonts then
> > copy-pasting from it gives us random english letters and punctuations.
> > It it's unicode, then copy-pasting gives us unicode text only, but
> > inaccurate. It's possible that someone might have made a converter for
> > this; if not, then if you have enough content then you could make your
> > own converter.
> >
> > If the PDF has Unicode font in it, then my method fails.
> >
> > I wasn't aware of the stackoverflow questions you've linked to. Great
> > insights here into why Unicode extraction is failing.
> >
> > If it's less pages then this free online multi-language OCR tool might
> > help: http://www.i2ocr.com/free-online-hindi-ocr
> > (per page time-taking process, so only advisable if content is less or
> > if you have a slave army of interns at your disposal :P)
> >
> >
> >
> >
> > --
> > Cheers,
> > Nikhil
> > +91-966-583-1250
> > Pune, India
> > Self-designed learner at Swaraj University <
> http://www.swarajuniversity.org>
> > http://nikhilsheth.blogspot.in
> >
> >
> >
> >
> >
> > On Tue, Sep 1, 2015 at 7:37 PM, Siddharth Vijayakrishnan
> > <[email protected] <mailto:[email protected]>> wrote:
> >
> >     Hi,
> >
> >     I downloaded a few files containing voter rolls and tried to parse
> >     the PDFs using pdfminer. Ran straight into a problem[1] where the
> >     glyphs are converted to unicode using a wrong character map.  Before
> >     I try and solve this on my own, I wonder if anyone in this community
> >     has a readymade solution ?
> >
> >     [1]
> >
> http://stackoverflow.com/questions/31876415/parsing-a-pdfdevanagari-script-using-pdfminer-gives-incorrect-output
> >
> >     --
> >     Datameet is a community of Data Science enthusiasts in India. Know
> >     more about us by visiting http://datameet.org
> >     ---
> >     You received this message because you are subscribed to the Google
> >     Groups "datameet" group.
> >     To unsubscribe from this group and stop receiving emails from it,
> >     send an email to [email protected]
> >     <mailto:datameet%[email protected]>.
> >     For more options, visit https://groups.google.com/d/optout.
> >
> >
> > --
> > Datameet is a community of Data Science enthusiasts in India. Know more
> > about us by visiting http://datameet.org
> > ---
> > You received this message because you are subscribed to the Google
> > Groups "datameet" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> > an email to [email protected]
> > <mailto:[email protected]>.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> Dr. Raphael Susewind | Political anthropologist, Associate CSASP Oxford
>           Snail Mail | Melanchthonstr. 4a, 33615 Bielefeld, Germany
>        Web & Twitter | http://www.raphael-susewind.de | @RaphaelSusewind
>
> Please do consider http://www.gnupg.org for encryption (key id 10AEE42F)
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to