[ 
https://issues.apache.org/jira/browse/PDFBOX-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Eltayeb updated PDFBOX-3719:
----------------------------------
    Summary: pdfbox parses spaces as tabs   (was: pdfbox reads spaces as tabs )

> pdfbox parses spaces as tabs 
> -----------------------------
>
>                 Key: PDFBOX-3719
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3719
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.13
>            Reporter: Ahmed Eltayeb
>         Attachments: DummyDoc.docx, DummyDoc.pdf
>
>
> i converted this pdf from the attached word document "DummyDoc.docx" 
> then when using pdfbox1.8 to extract text
> java -jar pdfbox-app-1.8.13.jar ExtractText "DummyDoc.pdf" txt.txt
> and the generated is 
> Dummy document        for     tag     extraction      
>       
> Section       1       
>       
> \\DummyTagOne_01  
> This  is      text    body    one     
>       
> \\DummyTagOne_02  
> This  is      text    body    two     
>       
> Section       2       
> \\DummyTagTwo_01  
> This  is      text    body    three   
>       
> \\DummyTagTwo_02  
> This  is      text    body    four    
>       
> \\DummyTagTwo_03  
> This  is      text    body    five    
> as you can see "This  is      text    body    one     " instead of "This is 
> text body one     " and so on 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to