[
https://issues.apache.org/jira/browse/PDFBOX-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237333#comment-17237333
]
Michael Klink commented on PDFBOX-5023:
---------------------------------------
[~richardazar],
{quote}however my concern is that the Arabic text is not getting extracted due
to the errors shared in the logs [OpenType Layout tables used in font
ArabicTransparent-ARABIC are not implemented in PDFBox and will be ignored]
{quote}
This is incorrect. You claim causality for two details which are not related by
causality.
PDF text extraction (as described by the PDF specification) does not look into
font programs. Thus, it would make no difference if PDFBox did support those
layout tables.
As mentioned in a previous comment the reason why e.g. Adobe Reader copy&paste
extracts a bit more text is that there are tagging properties (*ActualText*
properties) which can be used, too, during text extraction but the standard
PDFBox text stripper does not use these properties.
{quote}Kindly confirm that there is no fix for this issue. we are open for any
alternatives.
{quote}
As the causality central to your issue description does not exist, it does not
make sense to confirm something here.
What you can go for is enhancing the PDFBox text stripper to also take
*ActualText* properties into account.
> OpenType Layout tables used in font ArabicTransparent-ARABIC are not
> implemented in PDFBox and will be ignored
> --------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-5023
> URL: https://issues.apache.org/jira/browse/PDFBOX-5023
> Project: PDFBox
> Issue Type: Wish
> Components: FontBox, Text extraction
> Affects Versions: 2.0.8
> Reporter: Richard Azar
> Priority: Major
> Labels: fop-teaming
> Attachments: ExtractText.txt, log PDFbox.txt, pdfsample.pdf, sc1.PNG,
> sc2.PNG, sc3.PNG
>
>
> I am loading a PDF document with TrueType and TrueType CID Fonts (both within
> same document) and Only TrueType font texts are extracted usingĀ
> tStripper.getText.
> Getting the below error in logs (full logs attached)
> OpenType Layout tables used in font ArabicTransparent-ARABIC are not
> implemented in PDFBox and will be ignored.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]