Text extraction fails due to font problem with Type0, supplement-0 font
-----------------------------------------------------------------------
Key: PDFBOX-725
URL: https://issues.apache.org/jira/browse/PDFBOX-725
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.2.0
Environment: Fedora 11 or windows
Reporter: Peter Costello
Fix For: 1.2.0
Text extraction fails. In particular, download and view page 3 (1-based) of:
http://www.encana.com/investors/financial/annualreports/2008/pdfs/annual-report-2008.pdf
With pdfbox text extraction, most of the page is displayed as "?". Other pages
in the file have similar problems.
Trap at line#376 in org.apache.util.PDFStreamEngine.java (ie at "String c =
font.encode( string, i, codeLength );")
Trap conditionally when "string.length==52", this is the second occurance of
the problem.
Text extraction yields multiple "?" because the font encoding is not found.
The characters to be extracted are normal western characters.
The font COSDictionary contains:
COSName{Subtype}=COSName{Type0}
COSName{DescendantFonts}=COSArray{[COSObject{554, 0}]}
COSName{BaseFont}=COSName{HelveticaNeueLTStd-Lt-Identity-H}
COSName{Encoding}=COSName{Identity-H}
COSName{Type}=COSName{Font}
The "font.descendentFont" has the following COSDictionary items:
COSName{Subtype}=COSName{CIDFontType0}
COSName{FontDescriptor}=COSObject{540, 0}
COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
COSName{W}=...
COSName{CIDSystemInfo}=COSDictionary{(COSName{Supplement}:COSInt{0})
(COSName{Ordering}:COSString{Identity},
(COSName{Registry}:COSString{Adobe}) }
etc.
All the "CMap" lookups return null.
Code added to PDFont in February 2010 for Japanese characters tries
"Adobe-Identity-UCS2", but that does not work; I think that code fragment
should also
be trying "Adobe-Identity-0", or should not run at all if the dictionary
contains "Supplement=0". I manually tried "Adobe-Identity-0", but it does not
exist.
The descendentFont has an encoding that corresponds to Supplement-0, but
modifying PDFont to use the descendentFont encoding is not sufficient.
In particular, "descendentFont.getEncoding()" looks promising as a source for
the encoding.
What is very odd is that the second "string" displays in Acrobat Reader as
"Portfolio of ....", but the byte[] contains:
[0, 49, 0, 80, 0, 83, 0, 85, 0, 71, 0, 80, 0, 77, 0, 74, 0, 80, 0, 1, 0, 80, 0,
71, 0, 1, 0, 70, 0, 84, 0, 85, 0, 66, 0, 67, 0, 77, 0, 74, 0, 84, 0, 73, 0, 70,
0, 69, 0, 1, 0, 83]
These are not the corresponding ascii char, so it seems that these values are
either indexes into a font table, or the byte[] was loaded with incorrect
values.
I have confirmed that 2-bytes should be converted into a single char.
COSName{Subtype}=COSName{CIDFontType0}
COSName{FontDescriptor}=COSObject{540, 0}
COSName{BaseFont}=COSName{ALJOHE+HelveticaNeueLTStd-Lt}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.