[jira] [Commented] (PDFBOX-3438) only garbage extracted, lots of warnings "No Unicode mapping..."

Dan Dorazio (JIRA) Fri, 13 Jan 2017 13:38:13 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822384#comment-15822384
 ]


Dan Dorazio commented on PDFBOX-3438:
-------------------------------------

Hi all - 

I read the most recent response from 7.27.16, having to do with a bug in 
Distiller. However, I have a document created in 06' that has the same symptom. 
The text extraction occurs and the output is only garbage. Do you have an idea 
if the Distiller bug referenced above could be an issue at that time as well?

We are performing the extraction using the latest version of Apache Tika 
(1.14), which includes (and uses) PDFBOX 2.0.3. Unfortunately, I cannot share 
the document as it contains sensitive information. I'd be interested in the 
attached patch, but not sure how I'd implement it, given our use of Tika. I 
suppose I could try it outside of Tika and see if the result improves. Any 
other ideas on a workaround?

Thanks,
Dan

> only garbage extracted, lots of warnings "No Unicode mapping..."
> ----------------------------------------------------------------
>
>                 Key: PDFBOX-3438
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3438
>             Project: PDFBox
>          Issue Type: Wish
>          Components: Text extraction
>    Affects Versions: 2.0.2
>            Reporter: Oliver Steinau
>         Attachments: PDFBOX-3438.diff, PDFBOX-3438.txt, test.pdf
>
>
> When I try to extract text from this PDF, I get lots of warnings "No Unicode 
> mapping for ...", and as output I only get garbage.
> PDF file displays fine in Acrobat Reader, and pdftotext.exe will extract the 
> text just fine.
> PDF file seems to have a Type-1 font embedded with a custom encoding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3438) only garbage extracted, lots of warnings "No Unicode mapping..."

Reply via email to