Just a few things to add to that: pdftohtml has an xml output mode which is pretty good at reassembling paragraphs. http://pdftohtml.sourceforge.net
pdftohtml, although based on foo labs' xpdf, is a separate distribution from it. pdftotext is part of xpdf though. http://www.foolabs.com/xpdf/ ps2ascii is part of the ghostscript distribution, and doesn't care if the input is ps or pdf. http://www.ghostscript.com pstotext is yet another package, also dependent on ghostscript but separate from it. Again, since it depends on ghostscript it should work directly with pdf and not just ps. http://pages.cs.wisc.edu/~ghost/doc/pstotext.htm pdftool from the MuPDF library (now abandoned) can output a full parse tree of (some) pdfs. (It tends to choke on more recent versions, unfortunately.) The parse tree can be output as xml, and the text appears in that; unfortunately the text is often split up into single characters and would require xml postprocessing to be useful. http://ccxvii.net/apparition/ On 11/24/07, Joseph Kolibal <[EMAIL PROTECTED]> wrote: > There are several alternatives within linux. If there are other options, > I am not aware of them. I use pdftotext, i.e., > > pdftotext FILE.pdf > > to extract to FILE.txt. Alternatively, if that fails I convert the pdf to > a postscript file using > > pdf2ps FILE.pdf > > creating FILE.ps and then I try > > ps2ascii FILE.ps > > or > > pstotext FILE.ps > > I have also done the conversion from pdf to postscript using adobe acrobat's > acroread, i.e., do > > cat FILE.pdf|acroread -toPostScript -start PAGESTART -end PAGEEND > > NEWFILE.ps > > and obtained different results. In some cases I have also found it > convenient > to use the pdftk package, using > > pdftk FILE.pdf output NEWFILE.pdf uncompress > > in which the output is an uncompressed pdf which can be manipulated with a > text > editor to make corrections directly. This package can repair a corrupted > pdf, and it may > be necessary in some cases to try this. > > Finally, it is sometimes convenient to grab the text from the page display > of the pdf file using the mouse along with kpdf or xpdf. Needless to say, > extracting > non-text data such as tables and mathematics does not always succeed as well > as it desired. > > > Joseph > > > > On Fri, 23 Nov 2007 18:27:10 -0500 > Ben Marchbanks <[EMAIL PROTECTED]> wrote: > > > Is there a way to dump the text from a PDF independent of the PDF to > > image conversion process ? > > > > > > ---------------------------------- > Joseph Kolibal > The University of Southern Mississippi > Department of Mathematics > 118 College Drive 5045 > Hattiesburg, MS 39406-0001 > > E-mail: [EMAIL PROTECTED], [EMAIL PROTECTED] > Office: Room 207 Southern Hall, PH: 601-266-4301, FX: 601-266-5818 > > Web Links: > http://www.math.usm.edu/kolibal (Home pages) > http://www.math.usm.edu/cmi CMI (Computational Mathematics Information) > > Further contact: > Department of Mathematics > PH: 601-266-4289/FX: 601-266-5818 > http://www.usm.edu/math > > Sent: From athena > ---------------------------------- > _______________________________________________ > Magick-users mailing list > [email protected] > http://studio.imagemagick.org/mailman/listinfo/magick-users > _______________________________________________ Magick-users mailing list [email protected] http://studio.imagemagick.org/mailman/listinfo/magick-users
