Fonts should not be cached by PDFStreamEngine
---------------------------------------------
Key: PDFBOX-610
URL: https://issues.apache.org/jira/browse/PDFBOX-610
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 0.8.0-incubator
Environment: Win or Linux
Reporter: Peter Costello
Fix For: 0.8.0-incubator
org.apache.pdfbox.util.PDFStreamEngine
Fonts are cached using variable 'private Map documentFontCache = new
HashMap();'
which is used in method 'processSubStream()' and the call 'sr.fonts =
resources.getFonts(documentFontCache);
The problem is that PDF documents can store a limited range of 'firstChar' and
'lastChar' (maybe just a space char), and then expand that range at a later
point within the same page. When the font is cached, those updates are ignored.
In particular, test
'http://www.encana.com/investor/financial/shareholder/pdfs/info-circular-french.pdf,
pg 1'.
Using font caching, the widths of the characters in the upper right corner of
the page are reported as zero, and the text extraction and text merging is
compromised.
Without font caching, the widths are correct. There are other examples that
cause the same problem.
To fix the problem change the call in method 'processSubStream()' to:
sr.fonts = resources.getFonts(null);
There was some effort put into font caching. Unfortunately, it should not be
used on unknown documents.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.