[jira] [Commented] (PDFBOX-3796) Content of different table cells concatenated on text extraction in some cases

Tilman Hausherr (JIRA) Fri, 19 May 2017 10:57:39 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017756#comment-16017756
 ]


Tilman Hausherr commented on PDFBOX-3796:
-----------------------------------------

Sadly I can't answer this one. Not because I "don't know", but because when I 
tested text extraction differences from 1.8 to 2.0 a year ago I came to this 
conclusion that there is no perfect solution.

In some cases, setting the sort option is wrong, e.g. for certain PDFs with 
columns where the content stream is in reading order.

> Content of different table cells concatenated on text extraction in some cases
> ------------------------------------------------------------------------------
>
>                 Key: PDFBOX-3796
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3796
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7, 3.0.0
>            Reporter: Yauheni Salopiy
>              Labels: table, text_extraction
>         Attachments: fdl_relpub_foi_dailyre0313172017_2.0.6.txt, 
> fdl_relpub_foi_dailyre0313172017_3.0.txt, fdl_relpub_foi_dailyre0313172017.pdf
>
>
> Content of different table cells concatenated on text extraction in some 
> cases.
> Please, see in attachments one of the problematic pdf files and plain text 
> files extracted by PDFBox 2.0.6 and 3.0.0 (trunk)
> Snippet from the extracted text containing concatenated text content of 
> different cells:
>  INDIVIDUAL REC{color:#d04437}SJ{color}eanette 
> Bleckle{color:#d04437}y0{color}3/17/2017/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3796) Content of different table cells concatenated on text extraction in some cases

Reply via email to