[ 
https://issues.apache.org/jira/browse/PDFBOX-5747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17921207#comment-17921207
 ] 

Richard Eckart de Castilho commented on PDFBOX-5747:
----------------------------------------------------

> This looks "X^" in PDFDebugger or when opening the .txt file in firefox.

I didn't manage to get the PDFDebugger to open on my system... neither the 2.x 
version nor the 3.x version. Simply no window opened. No idea.

In the Eclipse console, the 𝑋̂ printed correctly.

Most importantly, though, the XML libs that I pipe the whole stuff through 
didn't choke anymore because of the broken surrogates.

If you have any better idea with respect to representing the compound... 🤷 

> Surrogate pairs with combining diacritics are incorrectly ordered on text 
> extraction
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5747
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5747
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.30, 2.0.33, 3.0.4 PDFBox
>            Reporter: P Crossa
>            Assignee: Tilman Hausherr
>            Priority: Major
>             Fix For: 2.0.34, 3.0.5 PDFBox, 4.0.0
>
>         Attachments: 
> PDFBOX-5747-unicode-surrogate-with-diacritic-reduced.pdf, PDFBOX-5747.patch, 
> invchar.pdf
>
>
> When extending {*}{{PDFTextStripper}}{*}, the *{{writeString}}* override 
> receives a {*}{{List<TextPosition>}}{*}. When iterating over them, the 
> {{*getUnicode()*}} call should return the Unicode representation of the 
> extracted text.
> However, for glyphs that require a surrogate pair (such as some mathematical 
> symbols, e.g. 𝑋) that are modified with a combining diacritic (such as ^), 
> the extracted Unicode characters are out of order.
> The attached PDF contains 𝑋̂. This is composed of 𝑋, which is represented as 
> the surrogate pair {color:#cc7832}\uD835\uDC4B {color}and the combining 
> diacritic {color:#cc7832}\u0302{color}
> {color:#172b4d}However, when extracted, we get 
> {color}\uD835\u0302\uDC4B{color:#172b4d} (the combining diacritic is placed 
> in between the two characters of the surrogate pair). This is an invalid 
> representation, and when encoded as a Json will break most parsers. The 
> expected output would be {color}\uD835\uDC4B\u0302



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to