On Tue, 19 Aug 2025 14:41:54 GMT, Volker Simonis <[email protected]> wrote:

>> ### TL;DR
>> 
>> This is a fix for what I think is a regression since the introduction of 
>> HarfBuzz in JDK 9. The problem is that the algorithm which converts the 
>> glyph vector produced by the layout engine into a corresponding character 
>> vector (in `ExtendedTextSourceLabel::createCharinfo()`) still assumes that 
>> "*each glyph maps to a single character*". But this is not true any more 
>> with HarfBuzz and as this example demonstrates, can lead to improper 
>> clustering of characters which can result to bad line breaking decisions.
>> 
>> I ran the corresponding JTreg and JCK test on Linux but because this area is 
>> heavily dependent on the OS and concrete fonts I'd like to kindly ask you to 
>> run your internal test suites in this area if possible.  
>> 
>> In the following you can find a longer (maybe a bit too long :) description 
>> of this problem which I merely wrote for my own memory.
>> 
>> ### Full description
>> 
>> A customer reported a regression in JDK 9+ which leads to bad/wrong line 
>> breaks for text in the Khmer language. Khmer is a [complex 
>> script](https://en.wikipedia.org/wiki/Khmer_script) which was only added to 
>> the Unicode standard 3.0 in 1999 (in the [Unicode block 
>> U+1780..U+17FF](https://en.wikipedia.org/wiki/Khmer_(Unicode_block))) and I 
>> personally don't understand Khmer at all :)
>> 
>> Fortunately, the customer could provide a [simple 
>> reproducer](https://bugs.openjdk.org/secure/attachment/115218/KhmerTest.java)
>>  which I could further condense to the following example: 
>> "បានស្នើសុំនៅតែត្រូវបានបដិសេធ" (according to Google translate, this means 
>> "*Requested but still denied*"). If we use OpenJDK's 
>> [`LineBreakMeasurer`](https://docs.oracle.com/en/java/javase/24/docs/api/java.desktop/java/awt/font/LineBreakMeasurer.html)
>>  to layout that paragraph (notice that Khmer has no spaces between words) to 
>> fit within a specific "wrapping width", the output may look as follows with 
>> JDK 8 (the exact output depends on the font and the wrapping width):
>> 
>> Segment: បានស្នើសុំ 0 10
>> Segment: នៅតែត្រូវ 10 9
>> Segment: បានបដិសេ 19 8
>> Segment: ធ 27 1
>> 
>> I ran with both, the logical 
>> [DIALOG](https://docs.oracle.com/en/java/javase/24/docs/api/java.desktop/java/awt/Font.html#DIALOG)
>>  font or directly with 
>> `/usr/share/fonts/truetype/ttf-khmeros-core/KhmerOS.ttf` on Ubuntu 22.04 (on 
>> my system DIALOG will automatically fall back to the KhmerOS font for 
>> characters from the Khmer Unicode code block). I also tried with the [Noto 
>> Khmer](https://fonts.google.com/noto/specimen/Noto+Serif+Khmer) f...
>
> Volker Simonis has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Added JTreg test to verify monotonically growing glyph character indices

> _Mailing list message from [Philip Race](mailto:[email protected]) on 
> [client-libs-dev](mailto:[email protected]):_
> 
> That's not relevant. This issue is about what you will contribute.
> 
> You should read and understand the OCA. If necessary you should consult your 
> organization's legal department for help.
> 
> -phil.
> 
> On 8/25/25 9:33 AM, Volker Simonis wrote:
> 
> > > > I can use 
> > > > [`hb-subset`](https://man.archlinux.org/man/extra/harfbuzz-utils/hb-subset.1.en)
> > > >  to create a subset of the [KhmerOS](https://www.cambodia.org/fonts/) 
> > > > open source font (licensed under LGPL 2.1 or later) which will be just 
> > > > enough for the test and check that in along with the test. The 
> > > > subsetted font file will be 28kb. Would that be acceptable
> > > > No. That won't be allowed. You aren't using your own IP.
> > > > Sorry, but I don't understand the problem? We have a bunch of other 
> > > > third-party libraries which are included in OpenJDK along with their 
> > > > corresponding license files. Just do a `find . -name legal -type d 
> > > > -exec echo {} ; -exec ls -la {} ;` in the top level directory to find 
> > > > them all.
> 
> -------------- next part -------------- An HTML attachment was scrubbed... 
> URL: 
> <https://mail.openjdk.org/pipermail/client-libs-dev/attachments/20250825/6bed3869/attachment-0001.htm>

There's obviously a way to push third-party code with appropriate license to 
the OpenJDK. I understand the OCA and I'm not insisting in pushing such changes 
myself, I just offered to create the corresponding PR such that you or somebody 
else can push it (just as you've pushed the DejaVu fonts).

But I also don't want to get into a licensing discussion here as well as I 
don't wan to solve the general problem of testing complex scripts layout in 
OpenJDK.

I think it is evident that this PR fixes a regression that is in OpenJDK since 
JDK 9. This regression can probably affect all complex scripts which do 
character reordering and ligatures. I think one of the reasons why it became 
apparent in Khmer script is that Khmer script is not using space between words. 
This means that in the OpenJDK, we  use the default RuleBasedBreakIterator for 
finding word boundaries because we have no dictionary support for Khmer.

This means that we can break at any cluster boundary (and in Khmer **only** at 
cluster boundaries because there's no white space between words) and cluster 
boundaries are broken since JDK 9+ because of the missing invisible glyphs. 
[ExtendedTextSourceLabel::getLineBreakIndex()](https://github.com/openjdk/jdk/blob/040cc7aee03e82e70bcbfcd2dde5cd4b35faeabd/src/java.desktop/share/classes/sun/font/ExtendedTextSourceLabel.java#L483)
 simply considers all the characters with zero advance to belong to a cluster 
and it won't break in the middle of a cluster. But because of the regression 
introduced by the HarfBuzz integration, we can get arbitrary long clusters 
which won't be broken.

This change is pretty simple. I don't think it does any harm and at least it 
contains a regression test which verifies the monotonic nature of cluster 
indices (which hasn't been tested until now). Please let us first push this 
simple fix before we try to achieve more ambitious goals.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26825#issuecomment-3221417422

Reply via email to