Re: [magick-users] Extract PDF Text

Joseph Kolibal Sat, 24 Nov 2007 06:44:28 -0800

There are several alternatives within linux. If there are other options,
I am not aware of them. I use pdftotext, i.e.,

  pdftotext FILE.pdf

to extract to FILE.txt. Alternatively, if that fails I convert the pdf to
a postscript file using 

 pdf2ps FILE.pdf

creating FILE.ps and then I try

 ps2ascii FILE.ps

or

 pstotext FILE.ps

I have also done the conversion from pdf to postscript using adobe acrobat's
acroread, i.e., do

 cat FILE.pdf|acroread -toPostScript -start PAGESTART -end PAGEEND > NEWFILE.ps

and obtained different results. In some cases I have also found it convenient
to use the pdftk package, using

    pdftk FILE.pdf output NEWFILE.pdf uncompress

in which the output is an uncompressed pdf which can be manipulated with a text
editor to make corrections directly. This package can repair a corrupted pdf, 
and it may
be necessary in some cases to try this.

Finally, it is sometimes convenient to grab the text from the page display
of the pdf file using the mouse along with kpdf or xpdf. Needless to say, 
extracting
non-text data such as tables and mathematics does not always succeed as well
as it desired.

                                    Joseph 

On Fri, 23 Nov 2007 18:27:10 -0500
Ben Marchbanks <[EMAIL PROTECTED]> wrote:

> Is there a way to dump the text from a PDF independent of  the PDF to
> image conversion process ?
> 

----------------------------------
Joseph Kolibal
The University of Southern Mississippi
Department of Mathematics
118 College Drive 5045
Hattiesburg, MS 39406-0001

E-mail: [EMAIL PROTECTED], [EMAIL PROTECTED]
Office: Room 207 Southern Hall, PH: 601-266-4301, FX: 601-266-5818

Web Links:
http://www.math.usm.edu/kolibal (Home pages)
http://www.math.usm.edu/cmi     CMI (Computational Mathematics Information)

Further contact:
 Department of Mathematics
 PH: 601-266-4289/FX: 601-266-5818
 http://www.usm.edu/math

Sent: From athena
----------------------------------
_______________________________________________
Magick-users mailing list
[email protected]
http://studio.imagemagick.org/mailman/listinfo/magick-users

Re: [magick-users] Extract PDF Text

Reply via email to