Re: [CODE4LIB] Scanned PDF to text

2014-12-11 Thread Chris Fitzpatrick
Tesseract is going to be slow, and there might not much you can do about
that.

You can do a couple of things, like set up a processes that run on AWS EC2
spot instances, so you can put a standing bid order on AWS instances and
only run your OCR when the price drops.

Or you can buy ABBYY , which is much faster.

b,chris.

b,chris.


On Tue, Dec 9, 2014 at 5:45 PM, Kyle Banerjee kyle.baner...@gmail.com
wrote:

  I’m not quite sure if I understand the question, but if all you want to
 do is pull the text out of an OCR’ed PDF file, then I have found both Tika
 and PDFtotext to be useful tools
 
  On the other hand, if you need to do the OCR itself, then employing
 Tesseract is probably the way to go.

 For clarity, I have to do the OCR itself. I've been using CAM::PDF to
 extract existing text.

 Kyle



Re: [CODE4LIB] Scanned PDF to text

2014-12-11 Thread David J. Fiander
Art Rhyno talked about doing this with scans of old community newspapers
a few years ago (https://www.youtube.com/watch?v=gcjCiS9pJ3A)

Yes, it's very compute intensive and slow. He set up Hadoop to farm jobs
out to the PCs in the library's public lab while the library was closed
at night.

- David

On 2014/12/11 03:59, Chris Fitzpatrick wrote:
 Tesseract is going to be slow, and there might not much you can do about
 that.
 
 You can do a couple of things, like set up a processes that run on AWS EC2
 spot instances, so you can put a standing bid order on AWS instances and
 only run your OCR when the price drops.
 
 Or you can buy ABBYY , which is much faster.
 
 b,chris.
 
 b,chris.
 
 
 On Tue, Dec 9, 2014 at 5:45 PM, Kyle Banerjee kyle.baner...@gmail.com
 wrote:
 
 I’m not quite sure if I understand the question, but if all you want to
 do is pull the text out of an OCR’ed PDF file, then I have found both Tika
 and PDFtotext to be useful tools

 On the other hand, if you need to do the OCR itself, then employing
 Tesseract is probably the way to go.

 For clarity, I have to do the OCR itself. I've been using CAM::PDF to
 extract existing text.

 Kyle



Re: [CODE4LIB] Scanned PDF to text

2014-12-09 Thread Mads Villadsen

On 2014-12-09 14:25, Kyle Banerjee wrote:

Howdy all,

I've just started a project that involves harvesting large numbers of
scanned PDF's and extracting information from the text from the OCR output.
The process I've started with -- use imagemagick to convert to tiff and
tesseract to pull out the OCR -- is more system intensive than I hoped it
would be.



I asked around the office and the process seems sensible overall. One 
suggestion was to use pdfimages instead of imagemagick as that should be 
faster.


However I would guess that most of the processing time is actually spent 
in tesseract so I don't know how much this suggestion will improve the 
overall performance.


Regards.

--
Mads Villadsen m...@statsbiblioteket.dk
Statsbiblioteket
It-udvikler


Re: [CODE4LIB] Scanned PDF to text

2014-12-09 Thread Eric Lease Morgan
On Dec 9, 2014, at 8:25 AM, Kyle Banerjee kyle.baner...@gmail.com wrote:

 I've just started a project that involves harvesting large numbers of
 scanned PDF's and extracting information from the text from the OCR output.
 The process I've started with -- use imagemagick to convert to tiff and
 tesseract to pull out the OCR -- is more system intensive than I hoped it
 would be.

I’m not quite sure if I understand the question, but if all you want to do is 
pull the text out of an OCR’ed PDF file, then I have found both Tika and 
PDFtotext to be useful tools. [1, 2] Here’s a Perl script that takes a PDF as 
input and used to Tika to output the OCR’ed text:

  #!/usr/bin/perl

  # configure
  use constant TIKA = 'java -jar tika.jar -T ';

  # require
  use strict;

  # initialize; needs sanity checking
  my $cmd = TIKA . $ARGV[ 0 ];

  # do the work
  print system $cmd;

  # done
  exit;

Tika can run in a server mode making it more efficient for extracting the text 
from multiple files. 

On the other hand, if you need to do the OCR itself, then employing Tesseract 
is probably the way to go. 

[1] Tika - http://tika.apache.org
[2] PDFtoText - http://www.foolabs.com/xpdf/download.html

—
ELM