[ 
https://issues.apache.org/jira/browse/PDFBOX-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237472#comment-17237472
 ] 

Tilman Hausherr edited comment on PDFBOX-5023 at 11/23/20, 4:33 PM:
--------------------------------------------------------------------

This is in the content stream. It's a command to replace an extracted item with 
something else.
{code}
BT
  131.129 718.38 Td
  /Span << /ActualText (\376\377\000.\000,\006G\006'\006F\006J\000 ) >> BDC
    (\000\003\001\223\001\212\001U\001\215\000\017\000\021) Tj
  EMC
ET
{code}
here the glyphs for "\000\003\001\223\001\212\001U\001\215\000\017\000\021" is 
displayed on the screen, but in text extraction, the text for 
"\376\377\000.\000,\006G\006'\006F\006J\000" would have to be used.

It isn't implemented, either because it is difficult or because people think it 
is difficult.


was (Author: tilman):
This is in the content stream. It's a command to replace an extracted item with 
something else.
{code}
BT
  131.129 718.38 Td
  /Span << /ActualText (\376\377\000.\000,\006G\006'\006F\006J\000 ) >> BDC
    (\000\003\001\223\001\212\001U\001\215\000\017\000\021) Tj
  EMC
ET
{code}
here the glyphs for "\000\003\001\223\001\212\001U\001\215\000\017\000\021" is 
displayed on the screen, but in text extraction, the text for 
"\376\377\000.\000,\006G\006'\006F\006J\000" would have to be used.

> OpenType Layout tables used in font ArabicTransparent-ARABIC are not 
> implemented in PDFBox and will be ignored
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5023
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5023
>             Project: PDFBox
>          Issue Type: Wish
>          Components: FontBox, Text extraction
>    Affects Versions: 2.0.8
>            Reporter: Richard Azar
>            Priority: Major
>              Labels: fop-teaming
>         Attachments: ExtractText.txt, log PDFbox.txt, pdfsample.pdf, sc1.PNG, 
> sc2.PNG, sc3.PNG
>
>
> I am loading a PDF document with TrueType and TrueType CID Fonts (both within 
> same document) and Only TrueType font texts are extracted usingĀ 
> tStripper.getText.
> Getting the below error in logs (full logs attached)
> OpenType Layout tables used in font ArabicTransparent-ARABIC are not 
> implemented in PDFBox and will be ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to