[
https://issues.apache.org/jira/browse/PDFBOX-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr closed PDFBOX-3719.
-----------------------------------
Resolution: Not A Problem
Closing as "not a problem". You can still comment and/or reopen.
> pdfbox parses spaces as tabs
> -----------------------------
>
> Key: PDFBOX-3719
> URL: https://issues.apache.org/jira/browse/PDFBOX-3719
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.13
> Reporter: Ahmed Eltayeb
> Attachments: DummyDoc.docx, DummyDoc.pdf
>
>
> i converted this pdf from the attached word document "DummyDoc.docx"
> then when using pdfbox1.8 to extract text
> java -jar pdfbox-app-1.8.13.jar ExtractText "DummyDoc.pdf" txt.txt
> and the generated is
> Dummy document for tag extraction
>
> Section 1
>
> \\DummyTagOne_01
> This is text body one
>
> \\DummyTagOne_02
> This is text body two
>
> Section 2
> \\DummyTagTwo_01
> This is text body three
>
> \\DummyTagTwo_02
> This is text body four
>
> \\DummyTagTwo_03
> This is text body five
> as you can see "This is text body one " instead of "This is
> text body one " and so on
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]