On Mon, 18 Aug 2025 16:05:24 GMT, Volker Simonis <simo...@openjdk.org> wrote:
> ### TL;DR > > This is a fix for what I think is a regression since the introduction of > HarfBuzz in JDK 9. The problem is that the algorithm which converts the glyph > vector produced by the layout engine into a corresponding character vector > (in `ExtendedTextSourceLabel::createCharinfo()`) still assumes that "*each > glyph maps to a single character*". But this is not true any more with > HarfBuzz and as this example demonstrates, can lead to improper clustering of > characters which can result to bad line breaking decisions. > > I ran the corresponding JTreg and JCK test on Linux but because this area is > heavily dependent on the OS and concrete fonts I'd like to kindly ask you to > run your internal test suites in this area if possible. > > In the following you can find a longer (maybe a bit too long :) description > of this problem which I merely wrote for my own memory. > > ### Full description > > A customer reported a regression in JDK 9+ which leads to bad/wrong line > breaks for text in the Khmer language. Khmer is a [complex > script](https://en.wikipedia.org/wiki/Khmer_script) which was only added to > the Unicode standard 3.0 in 1999 (in the [Unicode block > U+1780..U+17FF](https://en.wikipedia.org/wiki/Khmer_(Unicode_block))) and I > personally don't understand Khmer at all :) > > Fortunately, the customer could provide a [simple > reproducer](https://bugs.openjdk.org/secure/attachment/115218/KhmerTest.java) > which I could further condense to the following example: > "បានស្នើសុំនៅតែត្រូវបានបដិសេធ" (according to Google translate, this means > "*Requested but still denied*"). If we use OpenJDK's > [`LineBreakMeasurer`](https://docs.oracle.com/en/java/javase/24/docs/api/java.desktop/java/awt/font/LineBreakMeasurer.html) > to layout that paragraph (notice that Khmer has no spaces between words) to > fit within a specific "wrapping width", the output may look as follows with > JDK 8 (the exact output depends on the font and the wrapping width): > > Segment: បានស្នើសុំ 0 10 > Segment: នៅតែត្រូវ 10 9 > Segment: បានបដិសេ 19 8 > Segment: ធ 27 1 > > I ran with both, the logical > [DIALOG](https://docs.oracle.com/en/java/javase/24/docs/api/java.desktop/java/awt/Font.html#DIALOG) > font or directly with > `/usr/share/fonts/truetype/ttf-khmeros-core/KhmerOS.ttf` on Ubuntu 22.04 (on > my system DIALOG will automatically fall back to the KhmerOS font for > characters from the Khmer Unicode code block). I also tried with the [Noto > Khmer](https://fonts.google.com/noto/specimen/Noto+Serif+Khmer) fonts but the > results were similar, so I'... As @mrserb correctly mentioned, there's now no need to count `clusterExtraGlyphs` any more, so I've removed it completely from the code. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26825#issuecomment-3200747024