[ 
https://issues.apache.org/jira/browse/PDFBOX-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224259#comment-17224259
 ] 

Tilman Hausherr commented on PDFBOX-5002:
-----------------------------------------

I looked at my own files, and these are the ones that have differences: 
[^PDFBOX-4550-pdnekz1gvl7.pdf]  [^001991.pdf]  [^artikel1_20_arab.pdf]  
[^PDFBOX-756-p1.pdf]  [^PDFBOX-3062-005021.pdf]  [^PDFBOX-3248-spaces.pdf]. 
IMHO we can live with this. The differences are either unimportant or they make 
things better.

However it is still possible that people complain after release, in that case 
we'd have to revert and make this an option.

> PDFTextStripper sometimes fuses two words on different lines
> ------------------------------------------------------------
>
>                 Key: PDFBOX-5002
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5002
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.21
>            Reporter: Thierry Guérin
>            Priority: Minor
>             Fix For: 2.0.22, 3.0.0 PDFBox
>
>         Attachments: 001991.pdf, PDFBOX-3062-005021.pdf, 
> PDFBOX-3248-spaces.pdf, PDFBOX-4550-pdnekz1gvl7.pdf, PDFBOX-756-p1.pdf, 
> artikel1_20_arab.pdf, small&Big.pdf
>
>
> This happens when a text in a big font is followed by at least two lines of 
> text in a smaller font: the last word of the first line is merged with the 
> first word of the second line.
> On the attached PDF, the extracted text is :
> {noformat}
> (...) some text awith smaller font (...){noformat}
> instead of:
>  
> {noformat}
> (...) some text with a smaller font (...)
> {noformat}
> I often encounter this kind of problem on invoices, where the company address 
> (small text at the top right) is next to the company name & logo (big 
> centered text at the top).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to