Thanks for the information. >provided that you are following the rules for text extraction in ISO 32000 >(PDF standard) vs. doing something like “grep”.
I am using poppler (directly with pdftotext and indirectly with atril and okular). The original PDF has sequences like the one below for each font. Those Unicode CMaps can't be passed in postscript, so do I permanently lose useful text extraction when I convert this PDF to postscript with pdftops? /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <</Registry (Berkeley-Black-OV-BHQHLP) /Ordering (CIDUCS) /Supplement 0 >> def /CMapName /Berkeley-Black-OV-BHQHLP def /CMapType 2 def 1 begincodespacerange <0000> <FFFF> endcodespacerange 2 beginbfchar <0001> <0066> <0002> <006F> endbfchar endcmap CMapName currentdict /CMap defineresource pop end end William ________________________________ From: Leonard Rosenthol <[email protected]> Sent: Wednesday, July 1, 2020 10:43 AM To: William Bader <[email protected]>; [email protected] <[email protected]> Subject: Re: [poppler] pdftops font subset question Subsetting of a font has *ZERO* impact on the ability to extract text from a PDF…provided that you are following the rules for text extraction in ISO 32000 (PDF standard) vs. doing something like “grep”. Leonard From: poppler <[email protected]> on behalf of William Bader <[email protected]> Date: Wednesday, July 1, 2020 at 2:18 AM To: "[email protected]" <[email protected]> Subject: [poppler] pdftops font subset question Is there any way to prevent pdftops from subsetting fonts? I want to be able to convert the ps back to a PDF and still be able to extract text with pdftotext. I have a large single page PDF. When I drag to copy text in atril or okular or run pdftotext, it finds the text. pdffonts shows about 40 fonts. They are all similar: name type encoding emb sub uni object ID ------------------------------------ ----------------- ---------------- --- --- --- --------- HelveticaNeueLTStd-Roman--Identity-H CID Type 0C Identity-H yes no yes 214 0 HelveticaNeueLTStd-BdIt--Identity-H CID Type 0C Identity-H yes no yes 236 0 ... HelveticaLTStd-Bold--Identity-H CID Type 0C Identity-H yes no yes 70 0 Berkeley-Bold--Identity-H CID Type 0C Identity-H yes no yes 60 0 pdfinfo shows ModDate: Fri Jun 26 21:27:37 2020 WEST Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 1 Encrypted: no Page size: 702 x 1296 pts Page rot: 0 File size: 13501736 bytes Optimized: no PDF version: 1.6 When I run the PDF through pdftops, it subsets the fonts, and then when I convert it back into a PDF with ghostscript ps2pdf, the text shows, but copying it or running pdftotext does not work. The end of the generated ps is %%+ font BHQHNF+MinionPro-Regular %%+ font BHQHNG+Berkeley-Book %%+ font BHQHNH+HelveticaLTStd-Bold %%+ font BHQHNI+Berkeley-Bold %%EOF so it looks like pdftops is subsetting the fonts. "grep Berkeley-Bold", for example, shows %%BeginResource: font BHQHNI+Berkeley-Bold /CIDFontName /BHQHNI+Berkeley-Bold def /F60_0 /BHQHNI+Berkeley-Bold 0 pdfMakeFont16L3 %%+ font BHQHNI+Berkeley-Bold "grep -A 1 ' Tc$' x.ps | grep '(' | head" also appears to show that the fonts have been subsetted. (\000\025\000\014) (\000\015\000\024) (\000\001\000*) (\000\002\000\003\000\012) (\000\006\000\015) (\000\014\000\017\000\005\000\007) (\000\033\000\031) (\000\013\000"\000"\000\026\000\022) (\000\012\000\004) (\000\024\000\023\000\017\000\001) In testing, I also noticed that some pdftops options like -level3 generate ps files that crash ghostscript, but for now I think that is a ghostscript issue. https://bugs.ghostscript.com/show_bug.cgi?id=702526<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.ghostscript.com%2Fshow_bug.cgi%3Fid%3D702526&data=02%7C01%7Clrosenth%40adobe.com%7C62b37e7241c34165c04008d81d868ffb%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637291811063868341&sdata=ECAy%2BgQSc3Apfa5IvH6JiSjYNFqSPYzG6F3tTBZRLVw%3D&reserved=0> The ghostscript bug report has a copy of the PDF. I can post this as a poppler bug report, but I wanted to check first that I didn't miss a pdftops option or that there wasn't an internal flag that I could expose as an option in pdftops. William
_______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
