[
https://issues.apache.org/jira/browse/PDFBOX-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996179#comment-14996179
]
Tilman Hausherr commented on PDFBOX-3062:
-----------------------------------------
But the BBox height is huge... the result is that text that is on different
lines is extracted as being on the same line.
Yes we could of course get a real BBox by getting through the glyphs like I
recently did for type 3 fonts. But that would make text extraction slower.
At this time I'm not saying that anything should be done. This might just be a
single file that won't extract nicely. (Although the extraction will still be
OK for a search engine)
At the same time, I've noticed yesterday that many files I had marked for
further review are now good.
> Text extraction and height different in 2.0
> -------------------------------------------
>
> Key: PDFBOX-3062
> URL: https://issues.apache.org/jira/browse/PDFBOX-3062
> Project: PDFBox
> Issue Type: Sub-task
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Tilman Hausherr
> Attachments: 005021-reduced.pdf,
> PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced-marked-1.png,
> PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced.pdf,
> PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB.pdf,
> PDFBOX-3062-N2MOQ7YZICIYGTPLQJAWJ4HLN6CCEMHZ-reduced.pdf
>
>
> AR:
> {code}
> WITH THE increasing complexity of optical modules,
> {code}
> 1.8:
> {code}
> WITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=20.114626 space=7.472
> width=28.214272]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=3.3176804]I
> String[72.80568,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=6.0873947]T
> String[78.893074,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=7.1932907]H
> String[90.71916,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=6.0873947]T
> String[96.80656,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075
> width=7.1932907]H
> {code}
> 2.0:
> {code}
> W
> ITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=9.584274 space=7.472
> width=28.209717]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=3.3177567]I
> String[72.805756,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=6.0858]T
> String[78.891556,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=7.1949615]H
> String[90.719315,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=6.0858]T
> String[96.805115,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075
> width=7.1949615]H
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]