Re: [R] Extracting the first currency value from PDF files

Rasmus Liland Wed, 13 May 2020 07:27:23 -0700

On 2020-05-13 06:44 -0700, Jeff Newmiller wrote:
> On May 13, 2020 6:33:03 AM PDT, Manish Mukherjee wrote:
> > 
> > How to extract this value from a number 
> > of PDF files and put it in a data frame. 
> 
> they could be part of embedded bitmaps.


Dear Manish and Jeff,

I recently found the programs pdftoppm [1] 
and Google tesseract [2] to be really useful 
when reading text from pdfs formatted as "a 
single column of text of variable sizes", 
e.g. a receipt from a grocery store :)

folder <- "path/to/pdfs"
pdfs <- list.files(folder, ".pdf$")
pdf <- pdfs[1]
cmd <-
  paste0("pdftoppm -png -r 500 ",
         folder, pdf, " /tmp/out && ",
         "tesseract /tmp/out-1.png - ",
         "-l nor --psm 4")
lines <- system(cmd, intern=TRUE)
# x <- lapply(x, system, intern=TRUE)
# names(x) <- pdfs
# saveRDS(x, "texts.rds")

In any other case with a sensibly formatted 
pdf, I would have used pdftotext [3] ...

Best,
Rasmus

[1] https://manpages.debian.org/buster/poppler-utils/pdftoppm.1.en.html
[2] https://manpages.debian.org/buster/tesseract-ocr/tesseract.1.en.html
[3] https://manpages.debian.org/buster/poppler-utils/pdftotext.1.en.html

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Extracting the first currency value from PDF files

Reply via email to