Re: [CODE4LIB] Scanned PDF to text

Mads Villadsen Tue, 09 Dec 2014 05:42:38 -0800

On 2014-12-09 14:25, Kyle Banerjee wrote:

Howdy all,


I've just started a project that involves harvesting large numbers of
scanned PDF's and extracting information from the text from the OCR output.
The process I've started with -- use imagemagick to convert to tiff and
tesseract to pull out the OCR -- is more system intensive than I hoped it
would be.

I asked around the office and the process seems sensible overall. Onesuggestion was to use pdfimages instead of imagemagick as that should befaster.

However I would guess that most of the processing time is actually spentin tesseract so I don't know how much this suggestion will improve theoverall performance.


Regards.

--
Mads Villadsen <[email protected]>
Statsbiblioteket
It-udvikler

Re: [CODE4LIB] Scanned PDF to text

Reply via email to