[ 
https://issues.apache.org/jira/browse/PDFBOX-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237130#comment-17237130
 ] 

Maruan Sahyoun commented on PDFBOX-5023:
----------------------------------------

Hi Richard,

when a PDF is created and the font used embedded/referenced the should als be 
information generated for text extraction. That is that there should be an 
information which character (the char you'd like to get extracted) and glyph 
(that's what you see on screen/gets printed)  belong together. This is as 
typically a font is not embedded in full and only secrtain glphys are taken 
(the ones which are used in the doc).

Now when it comes to text extraction the toUnicode information is taken to get 
the char info for text extraction.

The messages you get are telling you that there are glyphs where such 
information is not available and as a result you will see something on screen 
but that doesn't get extracted.

This is an issue with the application generating the PDF as from that 
perspective the PDF is incomplete i.e. not suitable to be fully extracted to 
text.

To be clear, that is not an error with PDFBox and the information in the logs 
shall only inform you about the fact.

> OpenType Layout tables used in font ArabicTransparent-ARABIC are not 
> implemented in PDFBox and will be ignored
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5023
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5023
>             Project: PDFBox
>          Issue Type: Wish
>          Components: FontBox, Text extraction
>    Affects Versions: 2.0.8
>            Reporter: Richard Azar
>            Priority: Major
>              Labels: fop-teaming
>         Attachments: ExtractText.txt, log PDFbox.txt, pdfsample.pdf, sc1.PNG, 
> sc2.PNG, sc3.PNG
>
>
> I am loading a PDF document with TrueType and TrueType CID Fonts (both within 
> same document) and Only TrueType font texts are extracted using 
> tStripper.getText.
> Getting the below error in logs (full logs attached)
> OpenType Layout tables used in font ArabicTransparent-ARABIC are not 
> implemented in PDFBox and will be ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to