[ 
https://issues.apache.org/jira/browse/PDFBOX-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126212#comment-13126212
 ] 

Antoni Mylka commented on PDFBOX-940:
-------------------------------------

I stumbled upon the same problem, on a confidential file. In the process I 
think I found an issue: PDFBOX-1137.

I'm not a PDF expert, but in that file, I have the following PDF objects:

24 0 obj
<</Type/Font/Subtype/Type0/BaseFont/TT491A9C96tCID/Encoding 18 0 
R/DescendantFonts[22 0 R]>>
endobj

22 0 obj
<</Subtype/CIDFontType2/CIDSystemInfo 23 0 
R/BaseFont/XJXBKC+TT491A9C96tCID/Type/Font/Name/R22/FontDescriptor 21 0 R/DW 
1000
/W[691[259]
724[677
626
626]
737[677]]/CIDToGIDMap/Identity
>>
endobj

18 0 obj
<</Type/CMap/Name/R18/WMode 0/CMapName/WinCharSetFFFF-H/CIDSystemInfo<<
/Registry(Adobe)
/Ordering(WinCharSetFFFF)
/Supplement 0
>>
/Filter/FlateDecode/Length 19 0 R>>stream
endstream
endobj

So there is an embedded CMAP for WinCharSetFFFF-H, a parent font which refers 
to the embedded CMAP as its encoding, and a child font with no encoding. 
Applying the PDFBOX-1137 patch allowed the CMAP to be parsed. 

Then, in PDType0Font constructor, I added an if, just after the descendant font 
is constructed, I made it "inherit" the cmap from the parent font. This fixed 
NPEs during text extraction, which happened because the cmap was missing:

descendentFont = PDFontFactory.createFont( descendantFontDictionary );
if (descendentFont.cmap == null) {
  descendentFont.cmap = this.cmap;
}

I don't even know if this makes sense. Is the descendant font supposed to 
"inherit" the encoding from the parent font? This "fixed" the visible errors, 
but the output I get is still garbled. It's supposed to be a text in 
traditional Chinese. Can anyone with more PDF knowledge take a look at this?
                
> [pdmodel.font.PDFont] Error: Could not parse predefined CMAP  file for 
> 'PDFXC-Indentity0-0'
> -------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-940
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-940
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>         Environment: Tomcat 6.0.18, windows server 2003, pdfbox-1.4.0.jar
>            Reporter: krishna
>         Attachments: gen_preview1.png, oob_pdf.pdf, pdf fonts.JPG, pdf 
> fonts1.JPG, pdf fonts2.JPG, pdf properties1.JPG, pdf properties2.JPG, pdf 
> properties3.JPG
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi,
>    when i am trying to upload a pdf document the following error is thrown in 
> the tomcat.. i am using pdfbox-1.4.0.jar..
> 17:29:33,465  ERROR [pdmodel.font.PDFont] Error: Could not parse predefined 
> CMAP  file for 'PDFXC-Indentity0-0'
> please find the solution

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to