[jira] [Created] (PDFBOX-3464) character height 3 times higher than expected

Roman (JIRA) Tue, 16 Aug 2016 02:18:08 -0700

Roman created PDFBOX-3464:
-----------------------------

             Summary: character height 3 times higher than expected
                 Key: PDFBOX-3464
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3464
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
            Reporter: Roman
            Priority: Critical



The issue basically same as PDFBOX-2749, but wrong sample was attached to it by 
mistake. Correct PDF is attached here.

The core of the problem is that font height for this specific font is 
determined incorrectly, please see code with comments below.

{code}
public class Extractor extends PDFTextStripper {
//<...CUT...>
        protected void writePage() throws IOException {
                for (List<TextPosition> textList : charactersByArticle) { 
//charactersByArticle was inherited from base class
                        Iterator textIter = textList.iterator();
//<...CUT...>
                        while (textIter.hasNext()) {
                                TextPosition position = (TextPosition) 
textIter.next();
//<...CUT...>
                PDFontDescriptor fontDescriptor = 
position.getFont().getFontDescriptor();
//<...CUT...>

                float yscale = position.getTextPos().getYScale();
                float asc = Math.abs(fontDescriptor.getAscent() / 1000 * 
yscale);
                float rh = 
Math.abs(fontDescriptor.getFontBoundingBox().getUpperRightY() / 1000 * yscale);

                float desc = Math.abs(fontDescriptor.getDescent() / 1000 * 
yscale);
                float capHeight = Math.abs(fontDescriptor.getCapHeight() / 1000 
* yscale);
                if (capHeight == 0)
                        capHeight = position.getHeight();

                float h = (rh + Math.max(Math.max(capHeight, 
position.getHeight()), asc)) / 2;

//"h" evaluates to 37.39 (should be between 11 and 12)
//"desc" evaluates to 2.664
//"capHeight" evaluates to 37.39
//"position.getHeight()" evaluates to 33.48

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-3464) character height 3 times higher than expected

Reply via email to