[ https://issues.apache.org/jira/browse/PDFBOX-5747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17921205#comment-17921205 ]
Richard Eckart de Castilho commented on PDFBOX-5747: ---------------------------------------------------- [~tilman] Thanks for jumping on it right away :) I'm quite sure this fix will make pdfbox quite a bit more useful for processing scientific papers. > Surrogate pairs with combining diacritics are incorrectly ordered on text > extraction > ------------------------------------------------------------------------------------ > > Key: PDFBOX-5747 > URL: https://issues.apache.org/jira/browse/PDFBOX-5747 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.30, 2.0.33, 3.0.4 PDFBox > Reporter: P Crossa > Assignee: Tilman Hausherr > Priority: Major > Fix For: 2.0.34, 3.0.5 PDFBox, 4.0.0 > > Attachments: > PDFBOX-5747-unicode-surrogate-with-diacritic-reduced.pdf, PDFBOX-5747.patch, > invchar.pdf > > > When extending {*}{{PDFTextStripper}}{*}, the *{{writeString}}* override > receives a {*}{{List<TextPosition>}}{*}. When iterating over them, the > {{*getUnicode()*}} call should return the Unicode representation of the > extracted text. > However, for glyphs that require a surrogate pair (such as some mathematical > symbols, e.g. 𝑋) that are modified with a combining diacritic (such as ^), > the extracted Unicode characters are out of order. > The attached PDF contains 𝑋̂. This is composed of 𝑋, which is represented as > the surrogate pair {color:#cc7832}\uD835\uDC4B {color}and the combining > diacritic {color:#cc7832}\u0302{color} > {color:#172b4d}However, when extracted, we get > {color}\uD835\u0302\uDC4B{color:#172b4d} (the combining diacritic is placed > in between the two characters of the surrogate pair). This is an invalid > representation, and when encoded as a Json will break most parsers. The > expected output would be {color}\uD835\uDC4B\u0302 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org