Tried .. didn't work out well enough. The output is same as what I get out of xpdf
On Mon, May 24, 2010 at 7:51 PM, Dhananjay Nene <dhananjay.n...@gmail.com>wrote: > You may want to try out pdfminer. Its very similar to xpdf in structure and > should give you the parsed data into unicode directly. > > On Mon, May 24, 2010 at 7:13 PM, Eknath Venkataramani < > eknath.i...@gmail.com > > wrote: > > > I have around 45 pdfs to convert into raw text containing text in _HINDI_ > . > > When I use the xpdf package, the generated text is very weird, so I'd > like > > to write a program which would convert the pdf text into Unicode text as > it > > is. > > > > The fonts used in the pdfs: > > name type emb sub uni > object > > ID > > ------------------------------------ ----------------- --- --- --- > > --------- > > APKAPP+Usha-Bold Type 1C yes yes yes 72 > > 0 > > APKBBB+Agenda-Light Type 1C yes yes yes 77 > > 0 > > APKBGF+Usha Type 1C yes yes yes 41 > > 0 > > APKBKJ+Agenda-Medium Type 1C yes yes yes 46 > > 0 > > APKBON+Agenda-Bold Type 1C yes yes yes 49 > > 0 > > > > For eg. in the pdf: आदमी मुसाफिर है > > when I use pdftotext, I get some very weird symbols: '... > > .......' > > while i'd like 'आदमी मुसाफिर है' to be the output > > > > > > -- > > Eknath Venkataramani > > _______________________________________________ > > BangPypers mailing list > > BangPypers@python.org > > http://mail.python.org/mailman/listinfo/bangpypers > > > > > > -- > -------------------------------------------------------- > blog: http://blog.dhananjaynene.com > twitter: http://twitter.com/dnene > _______________________________________________ > BangPypers mailing list > BangPypers@python.org > http://mail.python.org/mailman/listinfo/bangpypers > -- Eknath Venkataramani _______________________________________________ BangPypers mailing list BangPypers@python.org http://mail.python.org/mailman/listinfo/bangpypers