On Tue, 19 Aug 2025 14:41:54 GMT, Volker Simonis <simo...@openjdk.org> wrote:
>> ### TL;DR >> >> This is a fix for what I think is a regression since the introduction of >> HarfBuzz in JDK 9. The problem is that the algorithm which converts the >> glyph vector produced by the layout engine into a corresponding character >> vector (in `ExtendedTextSourceLabel::createCharinfo()`) still assumes that >> "*each glyph maps to a single character*". But this is not true any more >> with HarfBuzz and as this example demonstrates, can lead to improper >> clustering of characters which can result to bad line breaking decisions. >> >> I ran the corresponding JTreg and JCK test on Linux but because this area is >> heavily dependent on the OS and concrete fonts I'd like to kindly ask you to >> run your internal test suites in this area if possible. >> >> In the following you can find a longer (maybe a bit too long :) description >> of this problem which I merely wrote for my own memory. >> >> ### Full description >> >> A customer reported a regression in JDK 9+ which leads to bad/wrong line >> breaks for text in the Khmer language. Khmer is a [complex >> script](https://en.wikipedia.org/wiki/Khmer_script) which was only added to >> the Unicode standard 3.0 in 1999 (in the [Unicode block >> U+1780..U+17FF](https://en.wikipedia.org/wiki/Khmer_(Unicode_block))) and I >> personally don't understand Khmer at all :) >> >> Fortunately, the customer could provide a [simple >> reproducer](https://bugs.openjdk.org/secure/attachment/115218/KhmerTest.java) >> which I could further condense to the following example: >> "បានស្នើសុំនៅតែត្រូវបានបដិសេធ" (according to Google translate, this means >> "*Requested but still denied*"). If we use OpenJDK's >> [`LineBreakMeasurer`](https://docs.oracle.com/en/java/javase/24/docs/api/java.desktop/java/awt/font/LineBreakMeasurer.html) >> to layout that paragraph (notice that Khmer has no spaces between words) to >> fit within a specific "wrapping width", the output may look as follows with >> JDK 8 (the exact output depends on the font and the wrapping width): >> >> Segment: បានស្នើសុំ 0 10 >> Segment: នៅតែត្រូវ 10 9 >> Segment: បានបដិសេ 19 8 >> Segment: ធ 27 1 >> >> I ran with both, the logical >> [DIALOG](https://docs.oracle.com/en/java/javase/24/docs/api/java.desktop/java/awt/Font.html#DIALOG) >> font or directly with >> `/usr/share/fonts/truetype/ttf-khmeros-core/KhmerOS.ttf` on Ubuntu 22.04 (on >> my system DIALOG will automatically fall back to the KhmerOS font for >> characters from the Khmer Unicode code block). I also tried with the [Noto >> Khmer](https://fonts.google.com/noto/specimen/Noto+Serif+Khmer) f... > > Volker Simonis has updated the pull request incrementally with one additional > commit since the last revision: > > Added JTreg test to verify monotonically growing glyph character indices Great write-up. I haven't followed all of the rabbit trails linked, but I think I understand the overall issue. Regarding testing -- My understanding is that including any open source fonts in the tests requires a long licensing approval process and is best avoided. An alternative is to script the creation of a custom test font (see e.g. `GlyphVectorGsubTest`), but it might be tricky to get right in this case, where we're dealing with an unfamiliar language and so many substitutions. One idea for testing would be to find all physical fonts which support Khmer at runtime (see e.g. `FormatCharAdvanceTest.getPhysicalFont(int)`, but with a `canDisplayUpTo` for the Khmer chars). Then for each supporting font, use the `LineBreakMeasurer` at a range of text widths (e.g. 100 - 500 pixels, at 10 pixel increments), and make sure that regardless of text width the measurer never leaves a single character on any line (except the last line, which would be OK). I know it's a bit hand-wavey, and quite black-box-ish, but it might be a good way to verify that the user-visible misbehavior that we're trying to fix is gone and doesn't resurface. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26825#issuecomment-3215320748