[
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075885#comment-15075885
]
Leo commented on PDFBOX-3175:
-----------------------------
The values are probably correct for PDFTextStripper class usage. I thought that
clustering TextPosition into lines is not a concern of PDFTextStreamEngine, but
of concern of PDFTextStripper class. So if a particular extention of
PDFTextStreamEngine class needs the coordinates to be divided by 2 it does not
mean that every other possible extention will also need them to be divided. I
think that particular line of code belongs to the part where line calculations
are done, not where the character stream is produced. Maybe it should be just
moved to PDFTextStreamEngine?
That all said, of course, if I understand correctly the point of architecturing
the two classes as separate.
> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
> Key: PDFBOX-3175
> URL: https://issues.apache.org/jira/browse/PDFBOX-3175
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Leo
> Attachments: MarketT_140815-1-marked-1-18.png,
> MarketT_140815-1-marked-1.png, PDFBOX-3175-reduced.pdf, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
> float verticalScaling = 1/1000f;
> if (font instanceof PDType3Font) {
> Matrix fontMatrix = font.getFontMatrix();
> verticalScaling = fontMatrix.getValue(1, 1);
> }
> float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]