Re: [magick-users] Extract PDF Text

Ross Presser Sat, 24 Nov 2007 09:50:32 -0800

Just a few things to add to that:

pdftohtml has an xml output mode which is pretty good at reassembling
paragraphs.
http://pdftohtml.sourceforge.net


pdftohtml, although based on foo labs' xpdf, is a separate
distribution from it. pdftotext is part of xpdf though.
http://www.foolabs.com/xpdf/

ps2ascii is part of the ghostscript distribution, and doesn't care if
the input is ps or pdf.
http://www.ghostscript.com

pstotext is yet another package, also dependent on ghostscript but
separate from it. Again, since it depends on ghostscript it should
work directly with pdf and not just ps.
http://pages.cs.wisc.edu/~ghost/doc/pstotext.htm

pdftool from the MuPDF library (now abandoned) can output a full parse
tree of (some) pdfs. (It tends to choke on more recent versions,
unfortunately.) The parse tree can be output as xml, and the text
appears in that; unfortunately the text is often split up into single
characters and would require xml postprocessing to be useful.
http://ccxvii.net/apparition/


On 11/24/07, Joseph Kolibal <[EMAIL PROTECTED]> wrote:
> There are several alternatives within linux. If there are other options,
> I am not aware of them. I use pdftotext, i.e.,
>
>   pdftotext FILE.pdf
>
> to extract to FILE.txt. Alternatively, if that fails I convert the pdf to
> a postscript file using
>
>  pdf2ps FILE.pdf
>
> creating FILE.ps and then I try
>
>  ps2ascii FILE.ps
>
> or
>
>  pstotext FILE.ps
>
> I have also done the conversion from pdf to postscript using adobe acrobat's
> acroread, i.e., do
>
>  cat FILE.pdf|acroread -toPostScript -start PAGESTART -end PAGEEND >
> NEWFILE.ps
>
> and obtained different results. In some cases I have also found it
> convenient
> to use the pdftk package, using
>
>     pdftk FILE.pdf output NEWFILE.pdf uncompress
>
> in which the output is an uncompressed pdf which can be manipulated with a
> text
> editor to make corrections directly. This package can repair a corrupted
> pdf, and it may
> be necessary in some cases to try this.
>
> Finally, it is sometimes convenient to grab the text from the page display
> of the pdf file using the mouse along with kpdf or xpdf. Needless to say,
> extracting
> non-text data such as tables and mathematics does not always succeed as well
> as it desired.
>
>
>                                     Joseph
>
>
>
> On Fri, 23 Nov 2007 18:27:10 -0500
> Ben Marchbanks <[EMAIL PROTECTED]> wrote:
>
> > Is there a way to dump the text from a PDF independent of  the PDF to
> > image conversion process ?
> >
>
>
>
> ----------------------------------
> Joseph Kolibal
> The University of Southern Mississippi
> Department of Mathematics
> 118 College Drive 5045
> Hattiesburg, MS 39406-0001
>
> E-mail: [EMAIL PROTECTED], [EMAIL PROTECTED]
> Office: Room 207 Southern Hall, PH: 601-266-4301, FX: 601-266-5818
>
> Web Links:
> http://www.math.usm.edu/kolibal (Home pages)
> http://www.math.usm.edu/cmi     CMI (Computational Mathematics Information)
>
> Further contact:
>  Department of Mathematics
>  PH: 601-266-4289/FX: 601-266-5818
>  http://www.usm.edu/math
>
> Sent: From athena
> ----------------------------------
> _______________________________________________
> Magick-users mailing list
> [email protected]
> http://studio.imagemagick.org/mailman/listinfo/magick-users
>
_______________________________________________
Magick-users mailing list
[email protected]
http://studio.imagemagick.org/mailman/listinfo/magick-users

Re: [magick-users] Extract PDF Text

Reply via email to