[ https://issues.apache.org/jira/browse/PDFBOX-5747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17921145#comment-17921145 ]
Tilman Hausherr commented on PDFBOX-5747: ----------------------------------------- I don't know why this was ignored, I usually have time during Christmas. I'm gonna make an attempt to run this later today. > Surrogate pairs with combining diacritics are incorrectly ordered on text > extraction > ------------------------------------------------------------------------------------ > > Key: PDFBOX-5747 > URL: https://issues.apache.org/jira/browse/PDFBOX-5747 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.30 > Reporter: P Crossa > Priority: Major > Attachments: PDFBOX-5747.patch, invchar.pdf > > > When extending {*}{{PDFTextStripper}}{*}, the *{{writeString}}* override > receives a {*}{{List<TextPosition>}}{*}. When iterating over them, the > {{*getUnicode()*}} call should return the Unicode representation of the > extracted text. > However, for glyphs that require a surrogate pair (such as some mathematical > symbols, e.g. 𝑋) that are modified with a combining diacritic (such as ^), > the extracted Unicode characters are out of order. > The attached PDF contains 𝑋̂. This is composed of 𝑋, which is represented as > the surrogate pair {color:#cc7832}\uD835\uDC4B {color}and the combining > diacritic {color:#cc7832}\u0302{color} > {color:#172b4d}However, when extracted, we get > {color}\uD835\u0302\uDC4B{color:#172b4d} (the combining diacritic is placed > in between the two characters of the surrogate pair). This is an invalid > representation, and when encoded as a Json will break most parsers. The > expected output would be {color}\uD835\uDC4B\u0302 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org