[ 
https://issues.apache.org/jira/browse/PDFBOX-3796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017756#comment-16017756
 ] 

Tilman Hausherr commented on PDFBOX-3796:
-----------------------------------------

Sadly I can't answer this one. Not because I "don't know", but because when I 
tested text extraction differences from 1.8 to 2.0 a year ago I came to this 
conclusion that there is no perfect solution.

In some cases, setting the sort option is wrong, e.g. for certain PDFs with 
columns where the content stream is in reading order.

> Content of different table cells concatenated on text extraction in some cases
> ------------------------------------------------------------------------------
>
>                 Key: PDFBOX-3796
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3796
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7, 3.0.0
>            Reporter: Yauheni Salopiy
>              Labels: table, text_extraction
>         Attachments: fdl_relpub_foi_dailyre0313172017_2.0.6.txt, 
> fdl_relpub_foi_dailyre0313172017_3.0.txt, fdl_relpub_foi_dailyre0313172017.pdf
>
>
> Content of different table cells concatenated on text extraction in some 
> cases.
> Please, see in attachments one of the problematic pdf files and plain text 
> files extracted by PDFBox 2.0.6 and 3.0.0 (trunk)
> Snippet from the extracted text containing concatenated text content of 
> different cells:
>  INDIVIDUAL REC{color:#d04437}SJ{color}eanette 
> Bleckle{color:#d04437}y0{color}3/17/2017/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to