[jira] [Commented] (PDFBOX-3438) only garbage extracted, lots of warnings "No Unicode mapping..."

Oliver Steinau (JIRA) Wed, 27 Jul 2016 02:21:41 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395294#comment-15395294
 ]


Oliver Steinau commented on PDFBOX-3438:
----------------------------------------

Thank you for your prompt reply! Unfortunately, I cannot build PDFBox from 
source, so I cannot use the patch.
Thinking about it, there's not much an extractor could do without a proper 
mapping. On the other hand, the file was created by Acrobat Distiller, which is 
not totally uncommon. Maybe it's worth the effort to examine other files 
created by Distiller, and add your solution to PDFBox as an optional feature 
for those files (maybe Distiller always omits the mappings, but always creates 
names like this).
Anyway, I would downgrade this issue to a "New feature" or a "Wish" -- or 
should it be deleted altogether?

> only garbage extracted, lots of warnings "No Unicode mapping..."
> ----------------------------------------------------------------
>
>                 Key: PDFBOX-3438
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3438
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.2
>            Reporter: Oliver Steinau
>         Attachments: PDFBOX-3438.diff, PDFBOX-3438.txt, test.pdf
>
>
> When I try to extract text from this PDF, I get lots of warnings "No Unicode 
> mapping for ...", and as output I only get garbage.
> PDF file displays fine in Acrobat Reader, and pdftotext.exe will extract the 
> text just fine.
> PDF file seems to have a Type-1 font embedded with a custom encoding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3438) only garbage extracted, lots of warnings "No Unicode mapping..."

Reply via email to