[jira] [Comment Edited] (PDFBOX-3043) Character is extracted twice

John Hewson (JIRA) Wed, 21 Oct 2015 18:04:25 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968291#comment-14968291
 ]


John Hewson edited comment on PDFBOX-3043 at 10/22/15 1:02 AM:
---------------------------------------------------------------

Yep, that's a LaTeX thing. The characters in the content stream are actually 
{code}c{code} and {code}  ⃝{code}  with the latter written on top of the 
former. That's just how TeX handles diacritics because its fonts predate 
Unicode, so it layers characters. Of course © is a single character nowadays, 
but TeX treats it as a combination of those two characters.

The encoding of the circle is a known mapping for the "circlecopyrt" character 
and is built-into PDFBox's [additional 
glyphlist|https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/resources/org/apache/pdfbox/resources/glyphlist/additional.txt];
 perhaps we should be mapping it to a combining circle instead. Note that the 
additional list is not standard in anyway, it's just a collection of common 
mappings which we've encountered over the years and ship. It's only really used 
by PDFTextStripper, though the code which loads it can be found in 
PDFTextStreamEngine.


was (Author: jahewson):
Yep, that's a LaTeX thing. The characters in the content stream are actually 
{code}c{code} and {code}  ⃝{code}  with the latter written on top of the 
former. That's just how TeX handles diacritics because its fonts predate 
Unicode, so it layers characters.

The encoding of the circle is a known mapping for the "circlecopyrt" character 
and is built-into PDFBox's [additional 
glyphlist|https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/resources/org/apache/pdfbox/resources/glyphlist/additional.txt];
 perhaps we should be mapping it to a combining circle instead. Note that the 
additional list is not standard in anyway, it's just a collection of common 
mappings which we've encountered over the years and ship. It's only really used 
by PDFTextStripper, though the code which loads it can be found in 
PDFTextStreamEngine.

> Character is extracted twice
> ----------------------------
>
>                 Key: PDFBOX-3043
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3043
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>         Attachments: cweb2.pdf
>
>
> This document has a © symbol. It's being extracted as "c©". I wanted to check 
> if this is a bug.
> One of the things that's strange about this is that PDFTextStripper first 
> processes "c" and then processes "©". However, PrintTextLocations prints them 
> in the other order
> String[214.936,618.879 fs=9.963 xscale=9.963 height=8.642903 space=9.963 
> width=9.962997]©
> String[217.704,618.579 fs=9.963 xscale=9.963 height=6.072449 space=5.537458 
> width=4.4235687]c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-3043) Character is extracted twice

Reply via email to