[jira] [Comment Edited] (PDFBOX-5002) PDFTextStripper sometimes fuses two words on different lines

Tilman Hausherr (Jira) Tue, 27 Oct 2020 12:45:26 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221702#comment-17221702
 ]


Tilman Hausherr edited comment on PDFBOX-5002 at 10/27/20, 7:44 PM:
--------------------------------------------------------------------

Seems nice. I need to review the result (differences) of tests with have my 
own, bigger test set.

The different extraction in the "EU" file could be problematic (although the 
result looks better). This is a test file of the Tabula project (there are 
many, but I kept that one as an early indictor of trouble). They don't want any 
extractions differences. 

The good thing is that the {{testTabula()}} test passes (it uses a different 
algorithm to get font heights). But I'd need to test the Tabula build too which 
has more tests.


was (Author: tilman):
Seems nice. I need review the result of tests with have my own, bigger test set.

The different extraction in the "EU" file could be problematic (although the 
result looks better). This is a test file of the Tabula project (there are 
many, but I kept that one as an early indictor of trouble). They don't want any 
extractions differences. 

The good thing is that the {{testTabula()}} test passes (it uses a different 
algorithm to get font heights). But I'd need to test the Tabula build too which 
has more tests.

> PDFTextStripper sometimes fuses two words on different lines
> ------------------------------------------------------------
>
>                 Key: PDFBOX-5002
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5002
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.21
>            Reporter: Thierry Guérin
>            Priority: Minor
>             Fix For: 2.0.22
>
>         Attachments: small&Big.pdf
>
>
> This happens when a text in a big font is followed by at least two lines of 
> text in a smaller font: the last word of the first line is merged with the 
> first word of the second line.
> On the attached PDF, the extracted text is :
> {noformat}
> (...) some text awith smaller font (...){noformat}
> instead of:
>  
> {noformat}
> (...) some text with a smaller font (...)
> {noformat}
> I often encounter this kind of problem on invoices, where the company address 
> (small text at the top right) is next to the company name & logo (big 
> centered text at the top).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-5002) PDFTextStripper sometimes fuses two words on different lines

Reply via email to