There are several alternatives within linux. If there are other options,
I am not aware of them. I use pdftotext, i.e.,
pdftotext FILE.pdf
to extract to FILE.txt. Alternatively, if that fails I convert the pdf to
a postscript file using
pdf2ps FILE.pdf
creating FILE.ps and then I try
ps2ascii FILE.ps
or
pstotext FILE.ps
I have also done the conversion from pdf to postscript using adobe acrobat's
acroread, i.e., do
cat FILE.pdf|acroread -toPostScript -start PAGESTART -end PAGEEND > NEWFILE.ps
and obtained different results. In some cases I have also found it convenient
to use the pdftk package, using
pdftk FILE.pdf output NEWFILE.pdf uncompress
in which the output is an uncompressed pdf which can be manipulated with a text
editor to make corrections directly. This package can repair a corrupted pdf,
and it may
be necessary in some cases to try this.
Finally, it is sometimes convenient to grab the text from the page display
of the pdf file using the mouse along with kpdf or xpdf. Needless to say,
extracting
non-text data such as tables and mathematics does not always succeed as well
as it desired.
Joseph
On Fri, 23 Nov 2007 18:27:10 -0500
Ben Marchbanks <[EMAIL PROTECTED]> wrote:
> Is there a way to dump the text from a PDF independent of the PDF to
> image conversion process ?
>
----------------------------------
Joseph Kolibal
The University of Southern Mississippi
Department of Mathematics
118 College Drive 5045
Hattiesburg, MS 39406-0001
E-mail: [EMAIL PROTECTED], [EMAIL PROTECTED]
Office: Room 207 Southern Hall, PH: 601-266-4301, FX: 601-266-5818
Web Links:
http://www.math.usm.edu/kolibal (Home pages)
http://www.math.usm.edu/cmi CMI (Computational Mathematics Information)
Further contact:
Department of Mathematics
PH: 601-266-4289/FX: 601-266-5818
http://www.usm.edu/math
Sent: From athena
----------------------------------
_______________________________________________
Magick-users mailing list
[email protected]
http://studio.imagemagick.org/mailman/listinfo/magick-users