[ 
https://issues.apache.org/jira/browse/PDFBOX-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031684#comment-14031684
 ] 

John Hewson commented on PDFBOX-1919:
-------------------------------------

I'm getting the same result too. So it does indeed seem that copy & paste is 
retrieving the "accessible" text (i.e. the span tags) whereas "save as text" is 
retrieving the Unicode values.

This raises the question of whether or not we should care about the span tags? 
I doubt that we should extract the text "greater than or equal to" just because 
the span tag for "≥" happens to contain that to appease legacy screen readers - 
that's not going to be the result that most users expect! In the case of this 
PDF, the span tags are as bad as the Unicode text - in fact, I'd say they're 
worse.

Suggestion: continue to ignore span tags.

> Font descriptor flags are not implemented
> -----------------------------------------
>
>                 Key: PDFBOX-1919
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1919
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.5, 1.8.6, 2.0.0
>            Reporter: Corentin Regal
>         Attachments: PDFBOX-1919.AdobeReader.txt, PDFBOX-1919.pdf, 
> PDFBOX-1919.txt
>
>
> The font descriptor flags are not set.
> They are described in the document "PDF reference 1.7" at : 5.7.1 Font 
> Descriptor Flags
> The methods in PDFontDescriptor are ready but never called :
> setFlags()
> setSerif()
> setAllCap() which is used in a lot of PDF
> ...
> I saw some TODO that relate to that issue in the code, is it planned to be 
> implemented soon?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to