[
https://issues.apache.org/jira/browse/PDFBOX-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992109#comment-12992109
]
Andreas Lehmkühler commented on PDFBOX-957:
-------------------------------------------
I'm afraid one can't extract the text from the given pdfs. Both are using fonts
with a non human readable encoding and as there isn't included any mapping,
you'll get rubbish instead of the text. Even the acrobat reader can't extract
the text.
> Text extraction using ExtractText (pdf file is input file) generates some
> weired characters
> -------------------------------------------------------------------------------------------
>
> Key: PDFBOX-957
> URL: https://issues.apache.org/jira/browse/PDFBOX-957
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.4.0
> Environment: Windows 7
> Reporter: Ashok Chigullapally
> Priority: Critical
> Labels: pdfbox, textExtraction
> Attachments: Resume1.pdf, Resume2.pdf
>
>
> When I tried to extract text from pdf document it is generating some
> gibberish text.
> ExtractText.exe "\Jobvite\Resumes\Resume-Boston.pdf Resume-Boston.txt
> Will provide the pdf documents when requested, I could not find a way to
> include attachments.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira