[ 
https://issues.apache.org/jira/browse/PDFBOX-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237333#comment-17237333
 ] 

Michael Klink commented on PDFBOX-5023:
---------------------------------------

[~richardazar],
{quote}however my concern is that the Arabic text is not getting extracted due 
to the errors shared in the logs [OpenType Layout tables used in font 
ArabicTransparent-ARABIC are not implemented in PDFBox and will be ignored]
{quote}
This is incorrect. You claim causality for two details which are not related by 
causality.

PDF text extraction (as described by the PDF specification) does not look into 
font programs. Thus, it would make no difference if PDFBox did support those 
layout tables.

As mentioned in a previous comment the reason why e.g. Adobe Reader copy&paste 
extracts a bit more text is that there are tagging properties (*ActualText* 
properties) which can be used, too, during text extraction but the standard 
PDFBox text stripper does not use these properties.
{quote}Kindly confirm that there is no fix for this issue. we are open for any 
alternatives.
{quote}
As the causality central to your issue description does not exist, it does not 
make sense to confirm something here.

What you can go for is enhancing the PDFBox text stripper to also take 
*ActualText* properties into account.

> OpenType Layout tables used in font ArabicTransparent-ARABIC are not 
> implemented in PDFBox and will be ignored
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5023
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5023
>             Project: PDFBox
>          Issue Type: Wish
>          Components: FontBox, Text extraction
>    Affects Versions: 2.0.8
>            Reporter: Richard Azar
>            Priority: Major
>              Labels: fop-teaming
>         Attachments: ExtractText.txt, log PDFbox.txt, pdfsample.pdf, sc1.PNG, 
> sc2.PNG, sc3.PNG
>
>
> I am loading a PDF document with TrueType and TrueType CID Fonts (both within 
> same document) and Only TrueType font texts are extracted usingĀ 
> tStripper.getText.
> Getting the below error in logs (full logs attached)
> OpenType Layout tables used in font ArabicTransparent-ARABIC are not 
> implemented in PDFBox and will be ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to