[
https://issues.apache.org/jira/browse/PDFBOX-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15027434#comment-15027434
]
Tilman Hausherr commented on PDFBOX-3062:
-----------------------------------------
I tried using CapHeight when available, but the results were very bad. Using
CapHeight as a fallback worked nicely on my test set, there were only
acceptable differences.
{code}
float glyphHeight = bbox.getHeight() / 2;
// sometimes the bbox has very high values, but CapHeight is OK
PDFontDescriptor fontDescriptor = font.getFontDescriptor();
if (fontDescriptor != null)
{
float capHeight = fontDescriptor.getCapHeight();
if (capHeight != 0 && capHeight < glyphHeight)
{
glyphHeight = capHeight;
}
}
{code}
I'll test this change with a larger test set. (but not today)
> Text extraction and height different in 2.0
> -------------------------------------------
>
> Key: PDFBOX-3062
> URL: https://issues.apache.org/jira/browse/PDFBOX-3062
> Project: PDFBox
> Issue Type: Sub-task
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Tilman Hausherr
> Attachments: 005021-reduced.pdf,
> PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced-marked-1.png,
> PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced.pdf,
> PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB.pdf,
> PDFBOX-3062-N2MOQ7YZICIYGTPLQJAWJ4HLN6CCEMHZ-reduced.pdf, garbled text 2.pdf
>
>
> AR:
> {code}
> WITH THE increasing complexity of optical modules,
> {code}
> 1.8:
> {code}
> WITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=20.114626 space=7.472
> width=28.214272]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=3.3176804]I
> String[72.80568,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=6.0873947]T
> String[78.893074,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=7.1932907]H
> String[90.71916,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=6.0873947]T
> String[96.80656,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=7.1932907]H
> {code}
> 2.0:
> {code}
> W
> ITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=9.584274 space=7.472
> width=28.209717]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=3.3177567]I
> String[72.805756,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=6.0858]T
> String[78.891556,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=7.1949615]H
> String[90.719315,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=6.0858]T
> String[96.805115,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=7.1949615]H
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]