[ 
https://issues.apache.org/jira/browse/PDFBOX-5035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245102#comment-17245102
 ] 

Tilman Hausherr commented on PDFBOX-5035:
-----------------------------------------

I modified the PrintTextLocations example and logged the "string" parameter 
(there is no "text" parameter) additionally to the existing output and got this:
{noformat}
string: '€  48,0000'
String[509.232,284.57202 fs=9.8 xscale=9.8 height=5.8310003 space=2.7244003 
width=5.4487915]€
String[514.6808,284.57202 fs=9.8 xscale=9.8 height=5.8310003 space=2.7244003 
width=2.7244263] 
String[517.4052,284.57202 fs=9.8 xscale=9.8 height=5.8310003 space=2.7244003 
width=2.7244263] 
String[520.12964,284.57202 fs=9.8 xscale=9.8 height=5.8310003 space=2.7244003 
width=5.4487915]4
String[525.5784,284.57202 fs=9.8 xscale=9.8 height=5.8310003 space=2.7244003 
width=5.4487915]8
String[531.0272,284.57202 fs=9.8 xscale=9.8 height=5.8310003 space=2.7244003 
width=2.7244263],
String[533.75165,284.57202 fs=9.8 xscale=9.8 height=5.8310003 space=2.7244003 
width=5.4487915]0
String[539.20044,284.57202 fs=9.8 xscale=9.8 height=5.8310003 space=2.7244003 
width=5.4487915]0
String[544.64923,284.57202 fs=9.8 xscale=9.8 height=5.8310003 space=2.7244003 
width=5.4487915]0
String[550.098,284.57202 fs=9.8 xscale=9.8 height=5.8310003 space=2.7244003 
width=5.4487915]0
{noformat}

(There is no guarantee that you'll get it this way, but this file has it. PDF 
creators are free to split the numbers, or to use different fonts, change the 
sequence, whatever)

> Missing character in text extraction
> ------------------------------------
>
>                 Key: PDFBOX-5035
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5035
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.21
>            Reporter: Marco Barbi
>            Priority: Major
>         Attachments: FT_FTDGT-03770_20.pdf, FT_FTDGT-03770_20.txt, 
> image-2020-12-07-09-47-40-046.png
>
>
> If applying the PDFTextStripper to the attached PDF, the highlghted text:
>  
> !image-2020-12-07-09-47-40-046.png|width=333,height=169!
>  
> is read as "8,0000" instead of "48,0000", then it seems the character "4" get 
> lost.
>  
> Is this a a bug or anything related to internal PDF structure?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to