Hi All, I have been trying to do OCR within R (reading PDF data which data as scanned image). Have been reading about this @ http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/
This a very good post. Effectively 3 steps: convert pdf to ppm (an image format) convert ppm to tif ready for tesseract (using ImageMagick for convert) convert tif to text file The effective code for the above 3 steps as per the link post: lapply(myfiles, function(i){ # convert pdf to ppm (an image format), just pages 1-10 of the PDF # but you can change that easily, just remove or edit the # -f 1 -l 10 bit in the line below shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 ocrbook"))) # convert ppm to tif ready for tesseract shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, ".tif"))) # convert tif to text file shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng"))) # delete tif file file.remove(paste0(i, ".tif" )) }) The first two steps are happening fine. (although taking good amount of time, for 4 pages of a pdf, but will look into the scalability part later, first trying if this works or not) While running this, the first two steps work fine. While runinng the 3rd step, i.e **shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))** I having this error: Error: evaluation nested too deeply: infinite recursion / options(expressions=)? Or Tesseract is crashing. Any workaround or root cause analysis would be appreciated. Regards, Anshuk Pal Chaudhuri [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.