[
https://issues.apache.org/jira/browse/PDFBOX-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822384#comment-15822384
]
Dan Dorazio commented on PDFBOX-3438:
-------------------------------------
Hi all -
I read the most recent response from 7.27.16, having to do with a bug in
Distiller. However, I have a document created in 06' that has the same symptom.
The text extraction occurs and the output is only garbage. Do you have an idea
if the Distiller bug referenced above could be an issue at that time as well?
We are performing the extraction using the latest version of Apache Tika
(1.14), which includes (and uses) PDFBOX 2.0.3. Unfortunately, I cannot share
the document as it contains sensitive information. I'd be interested in the
attached patch, but not sure how I'd implement it, given our use of Tika. I
suppose I could try it outside of Tika and see if the result improves. Any
other ideas on a workaround?
Thanks,
Dan
> only garbage extracted, lots of warnings "No Unicode mapping..."
> ----------------------------------------------------------------
>
> Key: PDFBOX-3438
> URL: https://issues.apache.org/jira/browse/PDFBOX-3438
> Project: PDFBox
> Issue Type: Wish
> Components: Text extraction
> Affects Versions: 2.0.2
> Reporter: Oliver Steinau
> Attachments: PDFBOX-3438.diff, PDFBOX-3438.txt, test.pdf
>
>
> When I try to extract text from this PDF, I get lots of warnings "No Unicode
> mapping for ...", and as output I only get garbage.
> PDF file displays fine in Acrobat Reader, and pdftotext.exe will extract the
> text just fine.
> PDF file seems to have a Type-1 font embedded with a custom encoding.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]