[
https://issues.apache.org/jira/browse/PDFBOX-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237130#comment-17237130
]
Maruan Sahyoun commented on PDFBOX-5023:
----------------------------------------
Hi Richard,
when a PDF is created and the font used embedded/referenced the should als be
information generated for text extraction. That is that there should be an
information which character (the char you'd like to get extracted) and glyph
(that's what you see on screen/gets printed) belong together. This is as
typically a font is not embedded in full and only secrtain glphys are taken
(the ones which are used in the doc).
Now when it comes to text extraction the toUnicode information is taken to get
the char info for text extraction.
The messages you get are telling you that there are glyphs where such
information is not available and as a result you will see something on screen
but that doesn't get extracted.
This is an issue with the application generating the PDF as from that
perspective the PDF is incomplete i.e. not suitable to be fully extracted to
text.
To be clear, that is not an error with PDFBox and the information in the logs
shall only inform you about the fact.
> OpenType Layout tables used in font ArabicTransparent-ARABIC are not
> implemented in PDFBox and will be ignored
> --------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-5023
> URL: https://issues.apache.org/jira/browse/PDFBOX-5023
> Project: PDFBox
> Issue Type: Wish
> Components: FontBox, Text extraction
> Affects Versions: 2.0.8
> Reporter: Richard Azar
> Priority: Major
> Labels: fop-teaming
> Attachments: ExtractText.txt, log PDFbox.txt, pdfsample.pdf, sc1.PNG,
> sc2.PNG, sc3.PNG
>
>
> I am loading a PDF document with TrueType and TrueType CID Fonts (both within
> same document) and Only TrueType font texts are extracted using
> tStripper.getText.
> Getting the below error in logs (full logs attached)
> OpenType Layout tables used in font ArabicTransparent-ARABIC are not
> implemented in PDFBox and will be ignored.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]