[jira] [Commented] (PDFBOX-2991) Improper word concatenation when extracting pdf

Maruan Sahyoun (JIRA) Thu, 15 Oct 2015 23:41:07 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960246#comment-14960246
 ]


Maruan Sahyoun commented on PDFBOX-2991:
----------------------------------------

it would be good to keep both versions for comparison especially as above 
comments no longer apply to the document available. In addition it would help 
to look at the low level content to figure why the extraction has improved for 
one document but hasn't for the other i.e. although the content is visually 
similar what is the low level information leading to that rendering.

> Improper word concatenation when extracting pdf
> -----------------------------------------------
>
>                 Key: PDFBOX-2991
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2991
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>         Attachments: sample-resume.pdf
>
>
> The code below will output text for a pdf. Words that are on different lines 
> are concatenated together
>     PDDocument pdDoc = PDDocument.load(new File("sample-resume.pdf"));
>     StringWriter writer = new StringWriter();
>     new PDFTextStripper().writeText(pdDoc, writer);
>     pdDoc.close();
>     System.out.println(writer.toString());



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-2991) Improper word concatenation when extracting pdf

Reply via email to