Indeed, the problematic PDF files do use render mode 3. At first I thought I might use the number of fonts a PDF uses to determine which ones had this hidden OCR, but some documents have quite a large number of fonts in them considering the whole thing is images and hidden text.
I don't see a way with the Poppler C++ API to determine if text is using render mode 3. The only thing provided is the text box rectangle and the text itself. At the moment, I've uncompressed the PDF using "podofouncompress" and in the results I see stuff like this: stream BT 3 Tr 0.00 Tc >From what I can tell, the Poppler tools and API don't offer any public means to uncompress a PDF file. Looking into how that works, hoping there is a way to do it programmatically without having to use system() calls to a 3rd party tool. Thanks for the hint about render mode 3. Stéphane On Fri, Oct 14, 2022 at 2:00 PM Leonard Rosenthol <[email protected]> wrote: > There are many different ways to add OCR’d text to a PDF, though one of > the most common is use of “hidden text”, where the text is drawn using Text > Render Mode 3. I don’t recall if Poppler exposes this information in the > public APIs, but it certainly has it in the graphic state internally. > > > > Leonard > > > > *From: *poppler <[email protected]> on behalf of > Stéphane Charette <[email protected]> > *Date: *Friday, October 14, 2022 at 2:54 PM > *To: *[email protected] <[email protected]> > *Subject: *[poppler] getting the text from PDF files > > *EXTERNAL: Use caution when clicking on links or opening attachments.* > > > > Using libpoppler-cpp-dev 0.86.1 on Ubuntu to read PDF files. Works well. > > > > doc->create_page(idx) to get the page, then page->text_list() to get all > the boxes. PDFs seem to either have text, or if it was a scan then I have > an image with no text, and I fall back to other techniques to read what I > need. > > > > But...! Some fax machines and business scanners try to do OCR, and embeds > the text results into the PDF. The quality of the OCR is poor, but when I > attempt to extract the text, I do get back the expected text boxes which > leads me down the wrong path. > > > > Is there anything in the way the text was added to the PDF that I can use > as a hint that the text was added to the PDF after-the-fact, and not as > part of the original PDF creation process? Something I can use to > determine if the text can be trusted? Reading up on things like Xref > tables to get an understanding of the internals of PDF files so I can > attempt to find a pattern between my "good" and "problematic" PDF files. > Wondered if there was a way to see if the text is part of the page itself, > or if it was tacked on afterwards. > > > > Thanks, > > > > Stéphane > > > > -- > > [image: Image removed by sender.] > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fabout.me%2Fstephane.charette%3Fpromo%3Demail_sig%26utm_source%3Dproduct%26utm_medium%3Demail_sig%26utm_campaign%3Dedit_panel%26utm_content%3Dthumb&data=05%7C01%7Clrosenth%40adobe.com%7C929dbafc69344f80df8f08daae159382%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638013704942713530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8sHxTZ4vVD6XTu1Vro0Bjm%2Fl1lUVdXU6hLVgXqVG0Uw%3D&reserved=0> > > [image: Image removed by sender.] > > *Stéphane Charette* > > about.me/stephane.charette > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fabout.me%2Fstephane.charette%3Fpromo%3Demail_sig%26utm_source%3Dproduct%26utm_medium%3Demail_sig%26utm_campaign%3Dedit_panel%26utm_content%3Dthumb&data=05%7C01%7Clrosenth%40adobe.com%7C929dbafc69344f80df8f08daae159382%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638013704942713530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8sHxTZ4vVD6XTu1Vro0Bjm%2Fl1lUVdXU6hLVgXqVG0Uw%3D&reserved=0> > > > -- <https://about.me/stephane.charette?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=edit_panel&utm_content=thumb> Stéphane Charette about.me/stephane.charette <https://about.me/stephane.charette?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=edit_panel&utm_content=thumb>
