Re: [BangPypers] extracting unicode text from pdfs

Eknath Venkataramani Mon, 24 May 2010 08:16:33 -0700

Tried .. didn't work out well enough. The output is same as what I get out
of xpdf


On Mon, May 24, 2010 at 7:51 PM, Dhananjay Nene <dhananjay.n...@gmail.com>wrote:

> You may want to try out pdfminer. Its very similar to xpdf in structure and
> should give you the parsed data into unicode directly.
>
> On Mon, May 24, 2010 at 7:13 PM, Eknath Venkataramani <
> eknath.i...@gmail.com
> > wrote:
>
> > I have around 45 pdfs to convert into raw text containing text in _HINDI_
> .
> > When I use the xpdf package, the generated text is very weird, so I'd
> like
> > to write a program which would convert the pdf text into Unicode text as
> it
> > is.
> >
> > The fonts used in the pdfs:
> > name                                   type              emb sub uni
> object
> > ID
> > ------------------------------------ ----------------- --- --- ---
> > ---------
> > APKAPP+Usha-Bold                     Type 1C           yes yes yes     72
> >  0
> > APKBBB+Agenda-Light                  Type 1C           yes yes yes     77
> >  0
> > APKBGF+Usha                          Type 1C           yes yes yes     41
> >  0
> > APKBKJ+Agenda-Medium                 Type 1C           yes yes yes     46
> >  0
> > APKBON+Agenda-Bold                   Type 1C           yes yes yes     49
> >  0
> >
> > For eg. in the pdf: आदमी मुसाफिर है
> >              when I use pdftotext, I get some very weird symbols: '...
> > .......'
> >             while i'd like 'आदमी मुसाफिर है' to be the output
> >
> >
> > --
> > Eknath Venkataramani
> > _______________________________________________
> > BangPypers mailing list
> > BangPypers@python.org
> > http://mail.python.org/mailman/listinfo/bangpypers
> >
>
>
>
> --
> --------------------------------------------------------
> blog: http://blog.dhananjaynene.com
> twitter: http://twitter.com/dnene
> _______________________________________________
> BangPypers mailing list
> BangPypers@python.org
> http://mail.python.org/mailman/listinfo/bangpypers
>



-- 
Eknath Venkataramani
_______________________________________________
BangPypers mailing list
BangPypers@python.org
http://mail.python.org/mailman/listinfo/bangpypers

Re: [BangPypers] extracting unicode text from pdfs

Reply via email to