[ 
https://issues.apache.org/jira/browse/PDFBOX-3715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15907776#comment-15907776
 ] 

Roman commented on PDFBOX-3715:
-------------------------------

[~tilman] In PrintTextLocations example *sortExtractedTextByPosition* mode is 
turned ON. To reproduce, please turn it Off. Spaces from text are eliminated by 
*suppressDuplicateOverlappingText* feature (this happens only in 2.0 but not in 
1.8). When *sortExtractedTextByPosition* mode is ON, the spaces from the begin 
of PDF (which are actually the reason for eliminating spaces from middle of 
text) are moved in place of eliminated spaces, so the output becomes "fixed".

> Text Stripper regression in 2.0
> -------------------------------
>
>                 Key: PDFBOX-3715
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3715
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Roman
>         Attachments: WindowsPhone7.pdf_page1_qdf.pdf
>
>
> When migrated from 1.8 to 2.0, we realized that some spaces are disappeared. 
> Please see attached PDF. Disappeared spaces are shown as blue boxes in it. 
> Those spaces WERE present in 1.8 version.
> Our App overrides *PDFTextStripper* class, implements *writePage()* method, 
> and uses *charactersByArticle* property, which is actually a list of all 
> *TextPosition* objects existing for every character from document.
> Some trailing spaces are disappeared from it. In the same time, those spaces 
> are present in PDF via explicit declaration. For example, these piece of 
> attached PDF contains the space right after "contents" word:
> {code}
> [( the content)-7(s )-2(of t)...]TJ
> {code}
> PS
>   I found that this bug occurs only when *sortExtractedTextByPosition* mode 
> is set to *false*. The spaces removed from inside the text by 
> *suppressDuplicateOverlappingText* feature, because there are another spaces 
> with almost same coordinates in the begin of document. When sorting, those 
> spaces are moved inside text, in place of removed ones. The remaining 
> question is how did this work in 1.8 and if PdfBox can be enhanced by adding 
> some "backward compatibility" mode so we can avoid of such regressions?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to