In one message or another, Mark Ehle said something like this: > I am using pdtotxt to extract text from pdf file in a digital newspaper > archive I am creating for a local public library. So far, it's working great. > But - I am using up a far amount of disk space and would like to figure out a > way to create an OCR'd pdf from an image and the bounding box data. That way > I would not have to store the PDF files as well as the images. Is there a way > to do that?
Seems like you would want to store the PDF instead of the images. Anyway, you should look at Tesseract: https://code.google.com/p/tesseract-ocr/ I haven't used it myself but, my understanding is, it'll embedded the OCR'd data into the PDF itself allowing searching, text selection, etc. from a PDF viewer. -e -- Ed Porras [email protected]
_______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
