[jira] [Commented] (PDFBOX-3464) character height 3 times higher than expected

Tilman Hausherr (JIRA) Wed, 24 Aug 2016 09:33:42 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435221#comment-15435221
 ]


Tilman Hausherr commented on PDFBOX-3464:
-----------------------------------------

Because the BBox height is divided by 2 in the existing code. I believe that 
the artists intended to get the approximate height of a/e/i/o/u. This value is 
then used to decide whether a glyph belongs to an existing line or doesn't.

But don't believe me... depending on what you do, a different strategy than 
ours might be better. What I'd recommend is that you build a good test set so 
that you can see what happens if the strategy changes.

Re 1000: "The glyph coordinate system is the space in which an individual 
character’s glyph is defined. All path coordinates and metrics shall be 
interpreted in glyph space. For all font types except Type 3, the units of 
glyph space are one-thousandth of a unit of text space; for a Type 3 font, the 
transformation from glyph space to text space shall be defined by a font matrix 
specified in an explicit FontMatrix entry in the font."

So 1000 is correct. (If it wasn't, display with PDFBox wouldn't work either for 
your file, but it works fine)

> character height 3 times higher than expected
> ---------------------------------------------
>
>                 Key: PDFBOX-3464
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3464
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Roman
>            Priority: Minor
>         Attachments: notHelped.png, nowItsHelped.png, screenshot-1.png, 
> screenshot.png, subnode.docx.pdf
>
>
> The issue basically same as PDFBOX-2749, but wrong sample was attached to it 
> by mistake. Correct PDF is attached here.
> The core of the problem is that font height for this specific font is 
> determined incorrectly, please see code with comments below.
> The issue was reproduced on Pdfbox 1.8.4, but as we tested before, same 
> result we get on 1.8.9 and 2.0 versions.
> {code}
> public class Extractor extends PDFTextStripper {
> //<...CUT...>
>       protected void writePage() throws IOException {
>               for (List<TextPosition> textList : charactersByArticle) { 
> //charactersByArticle was inherited from base class
>                       Iterator textIter = textList.iterator();
> //<...CUT...>
>                       while (textIter.hasNext()) {
>                               TextPosition position = (TextPosition) 
> textIter.next();
> //<...CUT...>
>               PDFontDescriptor fontDescriptor = 
> position.getFont().getFontDescriptor();
> //<...CUT...>
>               float yscale = position.getTextPos().getYScale();
>               float asc = Math.abs(fontDescriptor.getAscent() / 1000 * 
> yscale);
>               float rh = 
> Math.abs(fontDescriptor.getFontBoundingBox().getUpperRightY() / 1000 * 
> yscale);
>               float desc = Math.abs(fontDescriptor.getDescent() / 1000 * 
> yscale);
>               float capHeight = Math.abs(fontDescriptor.getCapHeight() / 1000 
> * yscale);
>               if (capHeight == 0)
>                       capHeight = position.getHeight();
>               float h = (rh + Math.max(Math.max(capHeight, 
> position.getHeight()), asc)) / 2;
> //"h" evaluates to 37.39 (should be between 11 and 12)
> //"desc" evaluates to 2.664
> //"capHeight" evaluates to 37.39
> //"position.getHeight()" evaluates to 33.48
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3464) character height 3 times higher than expected

Reply via email to