[
https://issues.apache.org/jira/browse/PDFBOX-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13829785#comment-13829785
]
Marc Teutelink commented on PDFBOX-1783:
----------------------------------------
We've tried this file with the Apache PDFBox tool "org.apache.pdfbox.PDFBox"
using the option "ExtractText" and got the same binary stuff out too. However,
the linux tool pdftotext is able to extract the text out just fine, so it
doesn't appear to be a corrupt file.
> PdfBox extracts werid signs instead of text
> -------------------------------------------
>
> Key: PDFBOX-1783
> URL: https://issues.apache.org/jira/browse/PDFBOX-1783
> Project: PDFBox
> Issue Type: Bug
> Components: PDFReader
> Affects Versions: 1.8.2
> Environment: Linux, MacOSX
> Reporter: Marc Teutelink
> Labels: patch
> Attachments: gaatfout.pdf,
> plain_text_tika_output_from_gaat_fout_pdf.txt,
> structured_text_tika_output_from_gaat_fout_pdf.xml
>
>
> PDFBox extracts complete bogus text from the attached document. I have
> attached the .PDF in question. I discovered this when using Tika, so I have
> linked the corresponding TIKA Jira issue to this issue as well.
--
This message was sent by Atlassian JIRA
(v6.1#6144)