[ 
https://issues.apache.org/jira/browse/PDFBOX-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984491#comment-14984491
 ] 

John Hewson edited comment on PDFBOX-3066 at 11/1/15 7:46 PM:
--------------------------------------------------------------

OS X Preview and Foxit extract the text as ")*+,-./012)456" but Acrobat 
extracts the correct text. The encoding in this CFF font is definitely corrupt. 
Acrobat is doing some black magic to correct things, but they must know 
something that we don't, because I can't see any telling information about how 
to detect and fix the problem.

Some interesting observations:

- the Font dictionary in the PDF has no Encoding or Flags entry. Note that 
Flags are required and affect encoding.
- the CFF font does not specify a Charset, so the default is used. This 
behaviour is described in the CFF spec, so it's normal, but still worth noting.
- the CFF font contains a valid format 0 encoding, but it doesn't match what we 
expect.
- the issue with the encoding isn't a simple off-by-one problem, e.g. adding 7 
to the SID yields "01234567890;<=", which is still incorrect. Rendering is 
perfect, so this isn't an encoding or charset bug in PDFBox - it's purely a 
text extraction thing.

I don't see how we can detect that such encodings are invalid without raising 
false positives. Adobe know something we don't. If we do find a fix, it can't 
occur in the Encoding or CFF layers, because the correct Encoding is being 
provided already to rendering. We would have to add some extra layer to 
"correct" the extracted text during the text extraction process itself. Perhaps 
in PDFTextStreamEngine.


was (Author: jahewson):
OS X Preview and Foxit extract the text as ")*+,-./012)456" but Acrobat 
extracts the correct text.

> Text getting garbled in this file, was Ok in 1.8
> ------------------------------------------------
>
>                 Key: PDFBOX-3066
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3066
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Joel Hirsh
>            Assignee: John Hewson
>             Fix For: 2.0.0
>
>         Attachments: PDFBOX-3066-reduced.pdf, garbled.pdf
>
>
> Attached file, PrintTextLocations shows text garbled, like *,%-))’)) 
> Acrobat copy/paste shows accurate text, and was also fine in 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to