Thanks again for the information. >CMaps and CID Fonts predate PDF and were introduced first in Postscript as >described in Adobe Technote 5014,
The PDF that is giving me problems has CID Type 0C fonts with the Identity-H encoding. When I edit the PDF, I can find objects like the one below at the end. It looks like pdftops isn't passing them to the postscript. >I can tell you that if I export a PDF using CIDFonts from Adobe Acrobat to >Postscript and run that Postscript though Acrobat Distiller – I get a fully >searchable PDF. I just have Linux, and I think that I don't have a way to run Acrobat. Would it be possible to take the PDF that I posted to https://bugs.ghostscript.com/show_bug.cgi?id=702526 and add the PS generated by Acrobat and the PDF generated from Distiller? I looked at the Adobe document that you linked and a few others that I already had, and they seemed to be about external cmap files. I would like to see an example of a ToUnicode CMap embedded in a postscript file. I am hoping that seeing a working postscript file combined with the documentation that you linked and what I can see by editing the PDF should be enough to find a way to get pdftops to generate it. Regards, William A section of the original PDF. I think that CMapType 2 is the ToUnicode map. poppler understands it or else pdftotext wouldn't work. I am hoping that it is something that poppler PSOutputDev::setupEmbeddedCIDType0Font() can generate. https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf 281 0 obj <</Filter/FlateDecode/Length 322>>stream /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CMapType 2 def /CMapName/R281 def 1 begincodespacerange <0000><ffff> endcodespacerange 30 beginbfrange <0001><0001><0043> <0002><0002><0048> ... <001f><001f><007a> endbfrange endcmap CMapName currentdict /CMap defineresource pop end end endstream endobj 212 0 obj <</BaseFont/MPJWBI+HelveticaNeueLTStd-BdIt/ToUnicode 281 0 R/Type/Font /Encoding /Identity-H/DescendantFonts[213 0 R]/Subtype/Type0>> endobj ________________________________ From: Leonard Rosenthol <[email protected]> Sent: Wednesday, July 1, 2020 2:48 PM To: William Bader <[email protected]>; [email protected] <[email protected]> Subject: Re: [poppler] pdftops font subset question > Those Unicode CMaps can't be passed in postscript, so do I permanently lose > useful text extraction when I convert this PDF to postscript with pdftops? > Of course they can! CMaps and CID Fonts predate PDF and were introduced first in Postscript as described in Adobe Technote 5014, https://www.adobe.com/content/dam/acom/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf I can tell you that if I export a PDF using CIDFonts from Adobe Acrobat to Postscript and run that Postscript though Acrobat Distiller – I get a fully searchable PDF. Now… whether pdftops will output them – I don’t know. And whether Ghostscript, upon encountering them, will correctly restore the font encoding. Again, I don’t know. Leonard
_______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
