[ 
https://issues.apache.org/jira/browse/PDFBOX-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-2817.
-----------------------------------
    Resolution: Won't Fix

Here's the content stream:
{code}
  q
    72.51 518 93.6 201.52 re
    W*
    n
    BT
      /F1 11.04 Tf
      1 0 0 1 77.664 561.46 Tm
      0 g
      0 G
      [ (CA) 3 (P) -4 (ITALIZ) 12 (ED) -5 (.E) 12 (M) -3 (A) ] TJ
    ET
  Q
  q
    72.51 518 93.6 201.52 re
    W*
    n
    BT
      /F1 11.04 Tf
      1 0 0 1 77.664 548.02 Tm
      0 g
      0 G
      [ (IL@ad) 5 (d) 3 (ress.c) 11 (o) -5 (m) 6 (  ) ] TJ
    ET
  Q
{code}
PDFBox works as intended, and there are no plans to improve this.

"CAPITALIZED.EMA" is separated from "[email protected]". It is on different 
lines. Acrobat is probably using some heuristics to join them.

> Extra whitespace produced while extracting bodycontent using PDFTextStripper
> ----------------------------------------------------------------------------
>
>                 Key: PDFBOX-2817
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2817
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6
>            Reporter: cheehoo
>         Attachments: File1.pdf, PDFBox-Whitespace.PNG, PDFBoxTest.java
>
>
> Written a sample java program that parse a pdf document however it seems that 
> it produce extra whitespace (eg :  CAPITALIZED.EMA [email protected]) in the 
> string return. Attach with the sample program code and the pdf document. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to