[
https://issues.apache.org/jira/browse/PDFBOX-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621240#comment-17621240
]
Michael Klink commented on PDFBOX-5529:
---------------------------------------
{quote}I looked for *ActualText* information, but I didn't find any tag like
this in the PDF content.{quote}
Then please share the PDF for further analysis.
While you're right that in case of your document the text extraction result
would improve by _not_ trying to identify gaps, in general one needs this gap
detection.
> Wrong Text Extraction - Unwanted Extra Spaces in the middle of words
> --------------------------------------------------------------------
>
> Key: PDFBOX-5529
> URL: https://issues.apache.org/jira/browse/PDFBOX-5529
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.0.5, 2.0.6, 2.0.7,
> 2.0.8, 2.0.9, 2.0.10, 2.0.11, 2.0.12, 2.0.13, 2.0.14, 2.0.15, 2.0.16, 2.0.17,
> 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22, 2.0.23, 2.0.24, 2.0.25, 2.0.26, 2.0.27
> Reporter: Carlos Alfonso Maya
> Priority: Major
> Attachments: image-2022-10-18-15-53-06-512.png,
> image-2022-10-18-16-23-00-123.png, image-2022-10-18-16-26-15-001.png,
> image-2022-10-19-16-48-36-198.png
>
>
> *Overview:*
> We are using PDFBOX as a third party API to extract text from financial PDF
> documents.
> We have been using PDFBox since a long time back, and we have detected a
> problem related to a bad text extraction on PDFs from a Customer.
> Since we worked with Customer Data we cannot shared the PDF besides that are
> signed and we cannot even edit them.
> *Description of the problem:*
> By opening the PDF in Adobe Reader we can see several cases like the
> following screenshot:
> !image-2022-10-18-15-53-06-512.png|width=221,height=211!
> Visually it appears to have spaces between words, but if we copy the text
> from Adobe Reader and paste it into a text editor there is no extra spaces.
> The following is the output that PDFBOX generates at the moment of doing text
> extraction:
> {code:java}
> Da te
> In v oice number
> Ou r r eference
> You r reference
> Con tact person{code}
> (!) *Important note: this behavior is present in all the versions of PDFBox.*
> *Analysis:*
> By downloading the PDFBOX source code 2.0.27 (this was checked as well in
> 2.0.26, 2.0.25 and 2.0.24) and testing/debugging we detected that the method
> _*writePage()* inside *PDFTextStripper.java*_ declared a list of objects:
> {code:java}
> List<LineItem> line = new ArrayList<LineItem>();{code}
> Which subsequently the code add elements into the list:
> {code:java}
> line.add(LineItem.getWordSeparator());
> .
> .
> .
> line.add(new LineItem(position));{code}
>
> And at some point it passes the list as a parameter into the following
> statement:
> {code:java}
> writeLine(normalize(line));{code}
> (!) *The important about this list called "line" is that somehow the
> "LineItem" objects are having NULL values inserted into it, and this values
> are at some point interpreted as "blank spaces" causing the behavior
> described above.*
> Here is an screenshot of how it is showed in the debugger:
> !image-2022-10-18-16-23-00-123.png|width=621,height=195!
> !image-2022-10-18-16-26-15-001.png|width=620,height=431!
>
> We tried to look for a method that manipulates this list and that we can
> override, but all of these methods that modified or access the list are
> protected.
>
> (!) *This is an example of how it displayed in the PDF Debugger:*
> {code:java}
> q
> 94.525 545.32 141 11.2 re
> W*
> n
> BT
> /F3 8.8 Tf
> 1 0 0 1 99.325 547.72 Tm
> 0 g
> 0 G
> [ (D) 22 (a) -131 (t) -109 (e) ] TJ
> ET
> Q
> q
> 94.525 530.9 141 11.225 re
> W*
> n
> BT
> /F3 8.8 Tf
> 1 0 0 1 99.325 533.3 Tm
> 0 G
> [ (I) 26 (n) -135 (v) -229 (o) -5 (i) 20 (ce) -62 ( ) 59 (n) -44 (u)
> 30 (m) -27 (b) -75 (e) 28 (r) ] TJ
> ET
> Q
> q
> 94.525 516.5 141 11.2 re
> W*
> n
> BT
> /F3 8.8 Tf
> 1 0 0 1 99.325 519.7 Tm
> 0 G
> [ (O) -73 (u) -151 (r) -44 ( ) 59 (r) -134 (e) 28 (f) -38 (e) 28 (r)
> -44 (e) 28 (n) -44 (ce) ] TJ
> ET
> Q{code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]