[ https://issues.apache.org/jira/browse/PDFBOX-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126212#comment-13126212 ]
Antoni Mylka commented on PDFBOX-940: ------------------------------------- I stumbled upon the same problem, on a confidential file. In the process I think I found an issue: PDFBOX-1137. I'm not a PDF expert, but in that file, I have the following PDF objects: 24 0 obj <</Type/Font/Subtype/Type0/BaseFont/TT491A9C96tCID/Encoding 18 0 R/DescendantFonts[22 0 R]>> endobj 22 0 obj <</Subtype/CIDFontType2/CIDSystemInfo 23 0 R/BaseFont/XJXBKC+TT491A9C96tCID/Type/Font/Name/R22/FontDescriptor 21 0 R/DW 1000 /W[691[259] 724[677 626 626] 737[677]]/CIDToGIDMap/Identity >> endobj 18 0 obj <</Type/CMap/Name/R18/WMode 0/CMapName/WinCharSetFFFF-H/CIDSystemInfo<< /Registry(Adobe) /Ordering(WinCharSetFFFF) /Supplement 0 >> /Filter/FlateDecode/Length 19 0 R>>stream endstream endobj So there is an embedded CMAP for WinCharSetFFFF-H, a parent font which refers to the embedded CMAP as its encoding, and a child font with no encoding. Applying the PDFBOX-1137 patch allowed the CMAP to be parsed. Then, in PDType0Font constructor, I added an if, just after the descendant font is constructed, I made it "inherit" the cmap from the parent font. This fixed NPEs during text extraction, which happened because the cmap was missing: descendentFont = PDFontFactory.createFont( descendantFontDictionary ); if (descendentFont.cmap == null) { descendentFont.cmap = this.cmap; } I don't even know if this makes sense. Is the descendant font supposed to "inherit" the encoding from the parent font? This "fixed" the visible errors, but the output I get is still garbled. It's supposed to be a text in traditional Chinese. Can anyone with more PDF knowledge take a look at this? > [pdmodel.font.PDFont] Error: Could not parse predefined CMAP file for > 'PDFXC-Indentity0-0' > ------------------------------------------------------------------------------------------- > > Key: PDFBOX-940 > URL: https://issues.apache.org/jira/browse/PDFBOX-940 > Project: PDFBox > Issue Type: Bug > Affects Versions: 1.4.0 > Environment: Tomcat 6.0.18, windows server 2003, pdfbox-1.4.0.jar > Reporter: krishna > Attachments: gen_preview1.png, oob_pdf.pdf, pdf fonts.JPG, pdf > fonts1.JPG, pdf fonts2.JPG, pdf properties1.JPG, pdf properties2.JPG, pdf > properties3.JPG > > Original Estimate: 48h > Remaining Estimate: 48h > > Hi, > when i am trying to upload a pdf document the following error is thrown in > the tomcat.. i am using pdfbox-1.4.0.jar.. > 17:29:33,465 ERROR [pdmodel.font.PDFont] Error: Could not parse predefined > CMAP file for 'PDFXC-Indentity0-0' > please find the solution -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira