Text extraction fails due to font problem with Type0, supplement-0 font
-----------------------------------------------------------------------

                 Key: PDFBOX-725
                 URL: https://issues.apache.org/jira/browse/PDFBOX-725
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.2.0
         Environment: Fedora 11 or windows
            Reporter: Peter Costello
             Fix For: 1.2.0


Text extraction fails. In particular, download and view page 3 (1-based) of:
http://www.encana.com/investors/financial/annualreports/2008/pdfs/annual-report-2008.pdf

With pdfbox text extraction, most of the page is displayed as "?". Other pages 
in the file have similar problems.
Trap at line#376 in org.apache.util.PDFStreamEngine.java (ie at "String c = 
font.encode( string, i, codeLength );")
Trap conditionally when "string.length==52", this is the second occurance of 
the problem.

Text extraction yields multiple "?" because the font encoding is not found.
The characters to be extracted are normal western characters.

The font COSDictionary contains:
COSName{Subtype}=COSName{Type0}
COSName{DescendantFonts}=COSArray{[COSObject{554, 0}]}
COSName{BaseFont}=COSName{HelveticaNeueLTStd-Lt-Identity-H}
COSName{Encoding}=COSName{Identity-H}
COSName{Type}=COSName{Font}

The "font.descendentFont" has the following COSDictionary items:
COSName{Subtype}=COSName{CIDFontType0}
COSName{FontDescriptor}=COSObject{540, 0}
COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
COSName{W}=...
COSName{CIDSystemInfo}=COSDictionary{(COSName{Supplement}:COSInt{0}) 
(COSName{Ordering}:COSString{Identity},
      (COSName{Registry}:COSString{Adobe}) }
etc.

All the "CMap" lookups return null. 
Code added to PDFont in February 2010 for Japanese characters tries 
"Adobe-Identity-UCS2", but that does not work; I think that code fragment 
should also
be trying "Adobe-Identity-0", or should not run at all if  the dictionary 
contains "Supplement=0". I manually tried "Adobe-Identity-0", but it does not 
exist.

The descendentFont has an encoding that corresponds to Supplement-0, but 
modifying PDFont to use the descendentFont encoding is not sufficient.
In particular, "descendentFont.getEncoding()" looks promising as a source for 
the encoding.

What is very odd is that the second  "string" displays in Acrobat Reader as 
"Portfolio of ....", but the byte[] contains:
[0, 49, 0, 80, 0, 83, 0, 85, 0, 71, 0, 80, 0, 77, 0, 74, 0, 80, 0, 1, 0, 80, 0, 
71, 0, 1, 0, 70, 0, 84, 0, 85, 0, 66, 0, 67, 0, 77, 0, 74, 0, 84, 0, 73, 0, 70, 
0, 69, 0, 1, 0, 83]
These are not the corresponding ascii char, so it seems that these values are 
either indexes into a font table, or the byte[] was loaded with incorrect 
values.
I have confirmed that 2-bytes should be converted into a single char.

  COSName{Subtype}=COSName{CIDFontType0}
  COSName{FontDescriptor}=COSObject{540, 0}
  COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to