Ahmed Eltayeb created PDFBOX-3719:
-------------------------------------

             Summary: pdfbox reads spaces as tabs 
                 Key: PDFBOX-3719
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3719
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 1.8.13
            Reporter: Ahmed Eltayeb
         Attachments: DummyDoc.docx, DummyDoc.pdf

i converted this pdf from the attached word document "DummyDoc.docx" 

then when using pdfbox1.8 to extract text
java -jar pdfbox-app-1.8.13.jar ExtractText "DummyDoc.pdf" txt.txt

and the generated is 

Dummy   document        for     tag     extraction      
        
Section 1       
        
\\DummyTagOne_01  
This    is      text    body    one     
        
\\DummyTagOne_02  
This    is      text    body    two     
        
Section 2       
\\DummyTagTwo_01  
This    is      text    body    three   
        
\\DummyTagTwo_02  
This    is      text    body    four    
        
\\DummyTagTwo_03  
This    is      text    body    five    


as you can see "This    is      text    body    one     " instead of "This is 
text body one     " and so on 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to