[jira] [Resolved] (PDFBOX-610) Fonts should not be cached by PDFStreamEngine

Resolved Sun, 22 Jan 2012 05:27:10 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andreas Lehmkühler resolved PDFBOX-610.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.7.0
         Assignee: Andreas Lehmkühler

I have to agree with Peter. To cache a font one has to ensure that fonts are 
100% equal, which is possible but complicated. It's not enough to just compare 
the name, the subtype and the encoding. I stumbled upon this issue when 
rendering the Centerplan.pdf attached to PDFBOX-615.

I removed the font caching in revision 1234506. I improved and hopefully 
simplified the handling of resources of a pdf as well. On one hand these 
changes may have an negative impact on the performance because of the missing 
font cache, but on the other hand all fonts are handled correct now.
                
> Fonts should not be cached by PDFStreamEngine
> ---------------------------------------------
>
>                 Key: PDFBOX-610
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-610
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: Win or Linux
>            Reporter: Peter Costello
>            Assignee: Andreas Lehmkühler
>              Labels: PDFStreamEngine, fontwidth
>             Fix For: 1.7.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.pdfbox.util.PDFStreamEngine
>    Fonts are cached using variable 'private Map documentFontCache = new 
> HashMap();'
>    which is used in method 'processSubStream()' and the call 'sr.fonts = 
> resources.getFonts(documentFontCache);
> The problem is that PDF documents can store a limited range of 'firstChar' 
> and 'lastChar' (maybe just a space char),  and then expand that range at a 
> later point within the same page. When the font is cached, those updates are 
> ignored. 
> In particular, test  
> 'http://www.encana.com/investor/financial/shareholder/pdfs/info-circular-french.pdf,
>  pg 1'.   
> Using font caching, the widths of the characters in the upper right corner of 
> the page are reported as zero, and the text extraction and text merging is 
> compromised.
> Without font caching, the widths are correct. There are other examples that 
> cause the same problem.
> To fix the problem change the call in method 'processSubStream()' to:
>              sr.fonts = resources.getFonts(null);
> There was some effort put into font caching.  Unfortunately, it should not be 
> used on unknown documents.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (PDFBOX-610) Fonts should not be cached by PDFStreamEngine

Reply via email to