[
https://issues.apache.org/jira/browse/PDFBOX-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman updated PDFBOX-3464:
--------------------------
Attachment: screenshot-1.png
> character height 3 times higher than expected
> ---------------------------------------------
>
> Key: PDFBOX-3464
> URL: https://issues.apache.org/jira/browse/PDFBOX-3464
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Reporter: Roman
> Priority: Critical
> Attachments: screenshot-1.png, subnode.docx.pdf
>
>
> The issue basically same as PDFBOX-2749, but wrong sample was attached to it
> by mistake. Correct PDF is attached here.
> The core of the problem is that font height for this specific font is
> determined incorrectly, please see code with comments below.
> {code}
> public class Extractor extends PDFTextStripper {
> //<...CUT...>
> protected void writePage() throws IOException {
> for (List<TextPosition> textList : charactersByArticle) {
> //charactersByArticle was inherited from base class
> Iterator textIter = textList.iterator();
> //<...CUT...>
> while (textIter.hasNext()) {
> TextPosition position = (TextPosition)
> textIter.next();
> //<...CUT...>
> PDFontDescriptor fontDescriptor =
> position.getFont().getFontDescriptor();
> //<...CUT...>
> float yscale = position.getTextPos().getYScale();
> float asc = Math.abs(fontDescriptor.getAscent() / 1000 *
> yscale);
> float rh =
> Math.abs(fontDescriptor.getFontBoundingBox().getUpperRightY() / 1000 *
> yscale);
> float desc = Math.abs(fontDescriptor.getDescent() / 1000 *
> yscale);
> float capHeight = Math.abs(fontDescriptor.getCapHeight() / 1000
> * yscale);
> if (capHeight == 0)
> capHeight = position.getHeight();
> float h = (rh + Math.max(Math.max(capHeight,
> position.getHeight()), asc)) / 2;
> //"h" evaluates to 37.39 (should be between 11 and 12)
> //"desc" evaluates to 2.664
> //"capHeight" evaluates to 37.39
> //"position.getHeight()" evaluates to 33.48
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]