[jira] Commented: (PDFBOX-725) Text extraction fails due to font problem with Type0, supplement-0 font

JIRA Thu, 24 Jun 2010 00:43:18 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882080#action_12882080
 ]


Andreas Lehmkühler commented on PDFBOX-725:
-------------------------------------------

My updates are related to the issue but they didn't solved it. I'm still 
investigating. The problem is the CID-encoding.

> Text extraction fails due to font problem with Type0, supplement-0 font
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-725
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-725
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.0
>         Environment: Fedora 11 or windows
>            Reporter: Peter Costello
>         Attachments: annual-report-2008_pg23.pdf
>
>
> Text extraction fails. In particular, download and view pg23 or others 
> (1-based) of:
> http://www.encana.com/investors/financial/annualreports/2008/pdfs/annual-report-2008.pdf
> With pdfbox text extraction, last 5 lines of page are displayed as "?". Other 
> pages in the file have similar problems.
> Text extraction yields multiple "?" because "font.encode(buf,i,2)" returns 
> null.
> The font COSDictionary contains:
> COSName{Subtype}=COSName{Type0}
> COSName{DescendantFonts}=COSArray{[COSObject{554, 0}]}
> COSName{BaseFont}=COSName{HelveticaNeueLTStd-Lt-Identity-H}
> COSName{Encoding}=COSName{Identity-H}
> COSName{Type}=COSName{Font}
> The "font.descendentFont" has the following COSDictionary items:
> COSName{Subtype}=COSName{CIDFontType0}
> COSName{FontDescriptor}=COSObject{540, 0}
> COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
> COSName{W}=...
> COSName{CIDSystemInfo}=COSDictionary{(COSName{Supplement}:COSInt{0}) 
> (COSName{Ordering}:COSString{Identity},(COSName{Registry}:COSString{Adobe}) }
> COSName{DW}=COSInt{1000}
> COSName{Type}=COSName{Font}
> The "fontDescriptor" of the descendentFont is:
> {COSName{StemV}=COSInt{58}, 
> COSName{FontName}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}, 
> COSName{FontFile3}=COSObject{543, 0}, 
> COSName{CIDSet}=COSObject{545, 0}, 
> COSName{Flags}=COSInt{6}, 
> COSName{Descent}=COSInt{-271}, 
> COSName{FontBBox}=COSArray{[COSInt{-166}, COSInt{-214}, COSInt{1050}, 
> COSInt{967}]}, COSName{Ascent}=COSInt{752}, 
> COSName{CapHeight}=COSInt{737}, 
> COSName{XHeight}=COSInt{553}, 
> COSName{Type}=COSName{FontDescriptor}, 
> COSName{ItalicAngle}=COSInt{0}, 
> COSName{StemH}=COSInt{45}}
> The last 5 lines on the page are:
> "Increased Cash Flow by 11 percent to $9,386 million;"
> "Increased Operating Earnings by ..."
> etc
> These 5 lines are encoded as 2 bytes per character (it is a type0 font)
> Each 2 byte code is offset by 31 from its displayed value.
> For instance, code "0x00, 0x01" should convert to ascii "0x0020" (a space).
> The font is an "Identity" font, which means codes should just map to latin 
> ISO chars.
> Yeah, this is a Type0 font which can display a subset of another font (the 
> latin ISO), 
> but how come codes differ from the ascii by +31?
> This same 31 offset is found on all other pages of the file using this font.
> The font descriptor for the descendentFont has "Flags=6". Bit 3 is "Symbolic".
> PDF Spec 5.7.1 "Font contains glyphs outside the Adobe standard Latin 
> character set."
> Maybe because the Font is "Symbolic" there is not a 1:1 map from codes to 
> ascii.
> The question is whether the PDF file specifies the +31 offset, and pdfbox 
> fails to properly account for this offset.. I can't find any reference to 
> such an offset in the PDF spec. The 'getFirstChar()' in the descendentFont is 
> -1, but the real value is"32". Maybe this +31 offset just equals 
> 'firstChar-1'?
> The real firstChar can be found via:
>   COSDictionary fontDict = (COSDictionary)font.getCOSObject();
>   COSArray descendantFontArray = 
> (COSArray)fontDict.getDictionaryObject(COSName.DESCENDANT_FONTS);
>   if (descendantFontArray != null)  {
>     COSDictionary descendantFontDictionary = 
> (COSDictionary)descendantFontArray.getObject(0);
>     PDFont descendentFont = 
> PDFontFactory.createFont(descendantFontDictionary);
>     Encoding encoding = descendentFont.getEncoding();
>     Iterator keyIterator = codeMap.keySet().iterator();
>     int firstChar=Integer.MAX_VALUE;
>     while (keyIterator.hasNext()) firstChar = 
> Math.min(firstChar,((Integer)keyIterator.next()).intValue());
>   }
> Other example on page 3 of the document:
> Text  "Portfolio of ...." displays in Acrobat Reader, but the byte[] contains:
> [0, 49, 0, 80, 0, 83, 0, 85, 0, 71, 0, 80, 0, 77, 0, 74, 0, 80, 0, 1, 0, 80, 
> 0, 71, 0, 1, 0, 70, 0, 84, 0, 85, 0, 66, 0, 67, 0, 77, 0, 74, 0, 84, 0, 73, 
> 0, 70, 0, 69, 0, 1, 0, 83]
> Again, if +31 is added to each of these 2-byte codes then the Ascii is found. 
> Where does this "+31" come from?  Acrobat reader gets it right.  How about 
> pdfbox?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-725) Text extraction fails due to font problem with Type0, supplement-0 font

Reply via email to