You must turn the file to postscript before you can read anything out of it,
and even then, a lot of the time words are broken up into different "show"

use pdf2ps (part of GNU GhostScript) to convert to PS and then search for
patterns like this:
(text) show
that is the most basic postscript syntax but often it is more complex than
that. The Adobe postscript 3 driver for windows creates all in one lines
with positioning and formatting parameters. So basically, you look for
anything in parenthesis to be text.

For indexing you may get by just looking for anything in parenthesis, but I
would look for a third party utility to do it if you want it done perfectly.

-Jeff Moss

----- Original Message ----- 
Sent: Friday, June 25, 2004 1:36 PM
Subject: [PHP-DB] Read a PDF file via PHP

> I'm working on a file upload system that accepts PDF files, reads the text
in those files, and enters it into a database, which makes the text  from
the PDF indexable and searchable.
> I've got it all down except for the ability to read the text from a PDF
> When the PHP read file function is used, the PDF file is read just fine,
but when I return the results to my browser, they're of course nothing but
jumble, because the code for the entire file was read.
> Is there any way to get PHP to simply read the PDF file for text
only--just the surface of it, just the words, as if it were a human reading
the PDF itself--and not for the internal code of the file?
> Thanks,
> Steve
> -- 
> PHP Database Mailing List (
> To unsubscribe, visit:

PHP Database Mailing List (
To unsubscribe, visit:

Reply via email to