[
https://issues.apache.org/jira/browse/PDFBOX-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler updated PDFBOX-610:
--------------------------------------
Fix Version/s: (was: 0.8.0-incubator)
> Fonts should not be cached by PDFStreamEngine
> ---------------------------------------------
>
> Key: PDFBOX-610
> URL: https://issues.apache.org/jira/browse/PDFBOX-610
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.8.0-incubator
> Environment: Win or Linux
> Reporter: Peter Costello
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> org.apache.pdfbox.util.PDFStreamEngine
> Fonts are cached using variable 'private Map documentFontCache = new
> HashMap();'
> which is used in method 'processSubStream()' and the call 'sr.fonts =
> resources.getFonts(documentFontCache);
> The problem is that PDF documents can store a limited range of 'firstChar'
> and 'lastChar' (maybe just a space char), and then expand that range at a
> later point within the same page. When the font is cached, those updates are
> ignored.
> In particular, test
> 'http://www.encana.com/investor/financial/shareholder/pdfs/info-circular-french.pdf,
> pg 1'.
> Using font caching, the widths of the characters in the upper right corner of
> the page are reported as zero, and the text extraction and text merging is
> compromised.
> Without font caching, the widths are correct. There are other examples that
> cause the same problem.
> To fix the problem change the call in method 'processSubStream()' to:
> sr.fonts = resources.getFonts(null);
> There was some effort put into font caching. Unfortunately, it should not be
> used on unknown documents.
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.