[ 
https://issues.apache.org/jira/browse/PDFBOX-3715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924627#comment-15924627
 ] 

Tilman Hausherr commented on PDFBOX-3715:
-----------------------------------------

Running without sort:

1.8:
{code}
String[213.27559,133.08313 fs=1.0 xscale=9.0 height=5.913 space=2.25 
width=2.25] 
...
String[213.33325,133.11017 fs=1.0 xscale=9.0 height=5.913 space=2.25 
width=2.25] 
{code}


2.0:
{code}
String[213.27559,133.08313 fs=1.0 xscale=9.0 height=5.9040003 space=2.25 
width=2.25] 
{code}


Ok, so I can reproduce the effect by disabling sort. 

2.0 with setSuppressDuplicateOverlappingText(false):
{code}
String[213.27559,133.08313 fs=1.0 xscale=9.0 height=5.9040003 space=2.25 
width=2.25] 
...
String[213.33325,133.11017 fs=1.0 xscale=9.0 height=5.9040003 space=2.25 
width=2.25] 
{code}

So your complaint seems to be that the character wasn't suppressed in 1.8 
despite suppressDuplicateOverlappingText==true (default) but is suppressed in 
2.0.

The reason is that in 1.8 processTextPosition() is never called in the 1.8 
PrintTextLocations example. In 2.0 it is called. I tried changing 1.8 a bit but 
then I thought this isn't about 1.8 so why bother...

You can get the effect you had in 1.8 by calling 
{{setSuppressDuplicateOverlappingText(false)}}. Does this solve your problem or 
not? Can I close the issue? IMO the 2.0 example works better because now the 
setting has an effect.



> Text Stripper regression in 2.0
> -------------------------------
>
>                 Key: PDFBOX-3715
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3715
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Roman
>         Attachments: WindowsPhone7.pdf_page1_qdf.pdf
>
>
> When migrated from 1.8 to 2.0, we realized that some spaces are disappeared. 
> Please see attached PDF. Disappeared spaces are shown as blue boxes in it. 
> Those spaces WERE present in 1.8 version.
> Our App overrides *PDFTextStripper* class, implements *writePage()* method, 
> and uses *charactersByArticle* property, which is actually a list of all 
> *TextPosition* objects existing for every character from document.
> Some trailing spaces are disappeared from it. In the same time, those spaces 
> are present in PDF via explicit declaration. For example, these piece of 
> attached PDF contains the space right after "contents" word:
> {code}
> [( the content)-7(s )-2(of t)...]TJ
> {code}
> PS
>   I found that this bug occurs only when *sortExtractedTextByPosition* mode 
> is set to *false*. The spaces removed from inside the text by 
> *suppressDuplicateOverlappingText* feature, because there are another spaces 
> with almost same coordinates in the begin of document. When sorting, those 
> spaces are moved inside text, in place of removed ones. The remaining 
> question is how did this work in 1.8 and if PdfBox can be enhanced by adding 
> some "backward compatibility" mode so we can avoid of such regressions?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to