[jira] [Closed] (PDFBOX-3719) pdfbox parses spaces as tabs

Tilman Hausherr (JIRA) Thu, 16 Mar 2017 13:55:36 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr closed PDFBOX-3719.
-----------------------------------
    Resolution: Not A Problem

Closing as "not a problem". You can still comment and/or reopen.

> pdfbox parses spaces as tabs 
> -----------------------------
>
>                 Key: PDFBOX-3719
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3719
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.13
>            Reporter: Ahmed Eltayeb
>         Attachments: DummyDoc.docx, DummyDoc.pdf
>
>
> i converted this pdf from the attached word document "DummyDoc.docx" 
> then when using pdfbox1.8 to extract text
> java -jar pdfbox-app-1.8.13.jar ExtractText "DummyDoc.pdf" txt.txt
> and the generated is 
> Dummy document        for     tag     extraction      
>       
> Section       1       
>       
> \\DummyTagOne_01  
> This  is      text    body    one     
>       
> \\DummyTagOne_02  
> This  is      text    body    two     
>       
> Section       2       
> \\DummyTagTwo_01  
> This  is      text    body    three   
>       
> \\DummyTagTwo_02  
> This  is      text    body    four    
>       
> \\DummyTagTwo_03  
> This  is      text    body    five    
> as you can see "This  is      text    body    one     " instead of "This is 
> text body one     " and so on 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Closed] (PDFBOX-3719) pdfbox parses spaces as tabs

Reply via email to