Hi All,

I have been trying to do OCR within R (reading PDF data which data as scanned 
image). Have been reading about this @ 
http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/

This a very good post.

Effectively 3 steps:

convert pdf to ppm (an image format)
convert ppm to tif ready for tesseract (using ImageMagick for convert)
convert tif to text file
The effective code for the above 3 steps as per the link post:

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), just pages 1-10 of the PDF
  # but you can change that easily, just remove or edit the
  # -f 1 -l 10 bit in the line below
  shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 
ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, 
".tif")))
  # convert tif to text file
  shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l 
eng")))
  # delete tif file
  file.remove(paste0(i, ".tif" ))
  })
The first two steps are happening fine. (although taking good amount of time, 
for 4 pages of a pdf, but will look into the scalability part later, first 
trying if this works or not)

While running this, the first two steps work fine.

While runinng the 3rd step, i.e

**shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l 
eng")))**
I having this error:

Error: evaluation nested too deeply: infinite recursion / options(expressions=)?

Or

Tesseract is crashing.

Any workaround or root cause analysis would be appreciated.

Regards,
Anshuk Pal Chaudhuri


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to