Using libpoppler-cpp-dev 0.86.1 on Ubuntu to read PDF files. Works well. doc->create_page(idx) to get the page, then page->text_list() to get all the boxes. PDFs seem to either have text, or if it was a scan then I have an image with no text, and I fall back to other techniques to read what I need.
But...! Some fax machines and business scanners try to do OCR, and embeds the text results into the PDF. The quality of the OCR is poor, but when I attempt to extract the text, I do get back the expected text boxes which leads me down the wrong path. Is there anything in the way the text was added to the PDF that I can use as a hint that the text was added to the PDF after-the-fact, and not as part of the original PDF creation process? Something I can use to determine if the text can be trusted? Reading up on things like Xref tables to get an understanding of the internals of PDF files so I can attempt to find a pattern between my "good" and "problematic" PDF files. Wondered if there was a way to see if the text is part of the page itself, or if it was tacked on afterwards. Thanks, Stéphane -- <https://about.me/stephane.charette?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=edit_panel&utm_content=thumb> Stéphane Charette about.me/stephane.charette <https://about.me/stephane.charette?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=edit_panel&utm_content=thumb>
