[jira] Commented: (PDFBOX-957) Text extraction using ExtractText (pdf file is input file) generates some weired characters

JIRA Tue, 08 Feb 2011 11:05:25 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992109#comment-12992109
 ]


Andreas Lehmkühler commented on PDFBOX-957:
-------------------------------------------

I'm afraid one can't extract the text from the given pdfs. Both are using fonts 
with a non human readable encoding and as there isn't included any mapping, 
you'll get rubbish instead of the text. Even the acrobat reader can't extract 
the text.

> Text extraction using ExtractText (pdf file is input file) generates some 
> weired characters
> -------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-957
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-957
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>         Environment: Windows 7
>            Reporter: Ashok Chigullapally
>            Priority: Critical
>              Labels: pdfbox, textExtraction
>         Attachments: Resume1.pdf, Resume2.pdf
>
>
> When I tried to extract text from pdf document it is generating some 
> gibberish text. 
> ExtractText.exe "\Jobvite\Resumes\Resume-Boston.pdf Resume-Boston.txt
> Will provide the pdf documents when requested, I could not find a way to 
> include attachments.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PDFBOX-957) Text extraction using ExtractText (pdf file is input file) generates some weired characters

Reply via email to