On 2020-05-13 06:44 -0700, Jeff Newmiller wrote: > On May 13, 2020 6:33:03 AM PDT, Manish Mukherjee wrote: > > > > How to extract this value from a number > > of PDF files and put it in a data frame. > > they could be part of embedded bitmaps.
Dear Manish and Jeff, I recently found the programs pdftoppm [1] and Google tesseract [2] to be really useful when reading text from pdfs formatted as "a single column of text of variable sizes", e.g. a receipt from a grocery store :) folder <- "path/to/pdfs" pdfs <- list.files(folder, ".pdf$") pdf <- pdfs[1] cmd <- paste0("pdftoppm -png -r 500 ", folder, pdf, " /tmp/out && ", "tesseract /tmp/out-1.png - ", "-l nor --psm 4") lines <- system(cmd, intern=TRUE) # x <- lapply(x, system, intern=TRUE) # names(x) <- pdfs # saveRDS(x, "texts.rds") In any other case with a sensibly formatted pdf, I would have used pdftotext [3] ... Best, Rasmus [1] https://manpages.debian.org/buster/poppler-utils/pdftoppm.1.en.html [2] https://manpages.debian.org/buster/tesseract-ocr/tesseract.1.en.html [3] https://manpages.debian.org/buster/poppler-utils/pdftotext.1.en.html ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.