[
https://issues.apache.org/jira/browse/PDFBOX-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221702#comment-17221702
]
Tilman Hausherr edited comment on PDFBOX-5002 at 10/27/20, 7:44 PM:
--------------------------------------------------------------------
Seems nice. I need to review the result (differences) of tests with have my
own, bigger test set.
The different extraction in the "EU" file could be problematic (although the
result looks better). This is a test file of the Tabula project (there are
many, but I kept that one as an early indictor of trouble). They don't want any
extractions differences.
The good thing is that the {{testTabula()}} test passes (it uses a different
algorithm to get font heights). But I'd need to test the Tabula build too which
has more tests.
was (Author: tilman):
Seems nice. I need review the result of tests with have my own, bigger test set.
The different extraction in the "EU" file could be problematic (although the
result looks better). This is a test file of the Tabula project (there are
many, but I kept that one as an early indictor of trouble). They don't want any
extractions differences.
The good thing is that the {{testTabula()}} test passes (it uses a different
algorithm to get font heights). But I'd need to test the Tabula build too which
has more tests.
> PDFTextStripper sometimes fuses two words on different lines
> ------------------------------------------------------------
>
> Key: PDFBOX-5002
> URL: https://issues.apache.org/jira/browse/PDFBOX-5002
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.21
> Reporter: Thierry Guérin
> Priority: Minor
> Fix For: 2.0.22
>
> Attachments: small&Big.pdf
>
>
> This happens when a text in a big font is followed by at least two lines of
> text in a smaller font: the last word of the first line is merged with the
> first word of the second line.
> On the attached PDF, the extracted text is :
> {noformat}
> (...) some text awith smaller font (...){noformat}
> instead of:
>
> {noformat}
> (...) some text with a smaller font (...)
> {noformat}
> I often encounter this kind of problem on invoices, where the company address
> (small text at the top right) is next to the company name & logo (big
> centered text at the top).
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]