Thanks again for the information.

>CMaps and CID Fonts predate PDF and were introduced first in Postscript as 
>described in Adobe Technote 5014,

The PDF that is giving me problems has CID Type 0C fonts with the Identity-H 
encoding.
When I edit the PDF, I can find objects like the one below at the end.
It looks like pdftops isn't passing them to the postscript.

>I can tell you that if I export a PDF using CIDFonts from Adobe Acrobat to 
>Postscript and run that Postscript though Acrobat Distiller – I get a fully 
>searchable PDF.

I just have Linux, and I think that I don't have a way to run Acrobat. Would it 
be possible to take the PDF that I posted to 
https://bugs.ghostscript.com/show_bug.cgi?id=702526  and add the PS generated 
by Acrobat and the PDF generated from Distiller?
I looked at the Adobe document that you linked and a few others that I already 
had, and they seemed to be about external cmap files.
I would like to see an example of a ToUnicode CMap embedded in a postscript 
file.
I am hoping that seeing a working postscript file combined with the 
documentation that you linked and what I can see by editing the PDF should be 
enough to find a way to get pdftops to generate it.

Regards, William

A section of the original PDF. I think that CMapType 2 is the ToUnicode map. 
poppler understands it or else pdftotext wouldn't work.
I am hoping that it is something that poppler 
PSOutputDev::setupEmbeddedCIDType0Font() can generate. 
https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf

281 0 obj
<</Filter/FlateDecode/Length 322>>stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R281 def
1 begincodespacerange
<0000><ffff>
endcodespacerange
30 beginbfrange
<0001><0001><0043>
<0002><0002><0048>
...
<001f><001f><007a>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end

endstream
endobj
212 0 obj
<</BaseFont/MPJWBI+HelveticaNeueLTStd-BdIt/ToUnicode 281 0 R/Type/Font
/Encoding /Identity-H/DescendantFonts[213 0 R]/Subtype/Type0>>
endobj


________________________________
From: Leonard Rosenthol <[email protected]>
Sent: Wednesday, July 1, 2020 2:48 PM
To: William Bader <[email protected]>; [email protected] 
<[email protected]>
Subject: Re: [poppler] pdftops font subset question


> Those Unicode CMaps can't be passed in postscript, so do I permanently lose 
> useful text extraction when I convert this PDF to postscript with pdftops?

>

Of course they can!   CMaps and CID Fonts predate PDF and were introduced first 
in Postscript as described in Adobe Technote 5014, 
https://www.adobe.com/content/dam/acom/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf



I can tell you that if I export a PDF using CIDFonts from Adobe Acrobat to 
Postscript and run that Postscript though Acrobat Distiller – I get a fully 
searchable PDF.



Now… whether pdftops will output them – I don’t know.   And whether 
Ghostscript, upon encountering them, will correctly restore the font encoding.  
Again, I don’t know.



Leonard


_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to