[ 
https://issues.apache.org/jira/browse/PDFBOX-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236982#comment-17236982
 ] 

Michael Klink commented on PDFBOX-5023:
---------------------------------------

Richard,

you are aware that quite a lot of the _text_ in your PDF actually is merely a 
bitmap image and, therefore, not subject to text extraction?

Adobe Reader indeed extracts a bit more during copy&paste (Ctrl-A, Ctrl-C) than 
PDFBox. But that has not to do with PDFBox not diving into the font programs 
but with the text stripper not interpreting *ActualText* tagging properties.



> OpenType Layout tables used in font ArabicTransparent-ARABIC are not 
> implemented in PDFBox and will be ignored
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5023
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5023
>             Project: PDFBox
>          Issue Type: Wish
>          Components: FontBox, Text extraction
>    Affects Versions: 2.0.8
>            Reporter: Richard Azar
>            Priority: Major
>              Labels: fop-teaming
>         Attachments: ExtractText.txt, log PDFbox.txt, pdfsample.pdf, sc1.PNG, 
> sc2.PNG
>
>
> I am loading a PDF document with TrueType and TrueType CID Fonts (both within 
> same document) and Only TrueType font texts are extracted usingĀ 
> tStripper.getText.
> Getting the below error in logs (full logs attached)
> OpenType Layout tables used in font ArabicTransparent-ARABIC are not 
> implemented in PDFBox and will be ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to