On 2014-12-09 14:25, Kyle Banerjee wrote:
Howdy all,
I've just started a project that involves harvesting large numbers of
scanned PDF's and extracting information from the text from the OCR output.
The process I've started with -- use imagemagick to convert to tiff and
tesseract to pull out the OCR -- is more system intensive than I hoped it
would be.
I asked around the office and the process seems sensible overall. One
suggestion was to use pdfimages instead of imagemagick as that should be
faster.
However I would guess that most of the processing time is actually spent
in tesseract so I don't know how much this suggestion will improve the
overall performance.
Regards.
--
Mads Villadsen <m...@statsbiblioteket.dk>
Statsbiblioteket
It-udvikler