Re: reading text out of ps/pdf

Herbert Voss Sat, 13 Jan 2001 11:23:48 -0800
Christopher Jones wrote:
> 
> I have that tool. But some pdf or ps files consist not of coded text but a
> bitmapped image. For instance, pdf and ps files which I download from journal
> databases are scanned images of journal pages. ps2ascii and pdftotext will not
> extract text from these files, since there is no ascii content to extract.
> 
> So my question is: is there any software out there which attempts to look at
> bitmaps and guess what the ascii would be-- something like those programs which
> read books through a scanner and try to match font characters to the image. And
> I say this question is a reach, because I know that those programs which I have
> heard about are either very expensive or very innacurate.

with

pdfimages -f 1 file.pdf DirForTheImages

extract all images in the pdf-file. with option -j you can save them
as jpegs, otherwise by default ppm or pbm - format (a good choice).
With 

pdftotext file.pdf file.txt

convert all to text.
when the pdf-file has some scanned-text, which are saved as images
you can convert these from pbm to tiff and than running an OCR
program.


Herbert

-- 
[EMAIL PROTECTED]
http://perce.de/lyx/
Re: reading text out of ps/pdf

Reply via email to