[
https://issues.apache.org/jira/browse/PDFBOX-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075881#comment-15075881
]
Tilman Hausherr edited comment on PDFBOX-3175 at 12/31/15 10:22 AM:
--------------------------------------------------------------------
I get this result with both PrintTextLocations examples on the
PDFBOX-3175-reduced.pdf file:
{code}
String[400.0,200.0 fs=20.0 xscale=20.0 height=11.2 space=5.5200005
width=16.200012]M
String[416.2,200.0 fs=20.0 xscale=20.0 height=11.2 space=5.5200005
width=5.1600037]I
String[421.36002,200.0 fs=20.0 xscale=20.0 height=11.2 space=5.5200005
width=14.480011]C
String[435.84003,200.0 fs=20.0 xscale=20.0 height=11.2 space=5.5200005
width=13.440002]E
String[449.28003,200.0 fs=20.0 xscale=20.0 height=11.2 space=5.5200005
width=12.76001]X
{code}
You wrote that
{quote}
Removing the division by 2 makes call to TextPosition almost identical to 1.8
style behavior
{quote}
What value is different for you? Note that this
{code}
// 1/2 the bbox is used as the height todo: why?
float glyphHeight = bbox.getHeight() / 2;
{code}
is the intended behavior at this time. Yes it looks weird, but it is a good
value to help identify lines that go together. If you don't like it, use the
solution for the blue marks in DrawPrintTextLocations, that uses the bounding
box only with no heuristics and no adjustment of "wild" values.
was (Author: tilman):
I get this result with both PrintTextLocations examples on the
PDFBOX-3175-reduced.pdf file:
{code}
String[400.0,200.0 fs=20.0 xscale=20.0 height=11.2 space=5.5200005
width=16.200012]M
String[416.2,200.0 fs=20.0 xscale=20.0 height=11.2 space=5.5200005
width=5.1600037]I
String[421.36002,200.0 fs=20.0 xscale=20.0 height=11.2 space=5.5200005
width=14.480011]C
String[435.84003,200.0 fs=20.0 xscale=20.0 height=11.2 space=5.5200005
width=13.440002]E
String[449.28003,200.0 fs=20.0 xscale=20.0 height=11.2 space=5.5200005
width=12.76001]X
{code}
You wrote that
{quote}
Removing the division by 2 makes call to TextPosition almost identical to 1.8
style behavior
{quote}
What value is different for you? Note that this
{code}
float glyphHeight = bbox.getHeight() / 2;
{code}
is the intended behavior at this time. Yes it looks weird, but it is a good
value to help identify lines that go together. If you don't like it, use the
solution for the blue marks in DrawPrintTextLocations, that uses the bounding
box only with no heuristics and no adjustment of "wild" values.
> PDFTextStreamEngine probably miscalculates text height
> ------------------------------------------------------
>
> Key: PDFBOX-3175
> URL: https://issues.apache.org/jira/browse/PDFBOX-3175
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Leo
> Attachments: MarketT_140815-1-marked-1-18.png,
> MarketT_140815-1-marked-1.png, PDFBOX-3175-reduced.pdf, snapshot.png
>
>
> When parsing a PDF document, TextPosition is created with constant text
> height, about 2 time smaller than character width, regardless of font size.
> The following workaround to calculate dyDisplay fixes the issue:
> float verticalScaling = 1/1000f;
> if (font instanceof PDType3Font) {
> Matrix fontMatrix = font.getFontMatrix();
> verticalScaling = fontMatrix.getValue(1, 1);
> }
> float dyDisplay = bbox.getHeight() * fontSize * verticalScaling;
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]