[jira] [Commented] (GROOVY-12085) SourceText slices UTF-16 with code-point AST columns, truncating source for supplementary characters

Paul King (Jira) Fri, 12 Jun 2026 18:07:09 -0700


    [ 
https://issues.apache.org/jira/browse/GROOVY-12085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088656#comment-18088656
 ]


Paul King commented on GROOVY-12085:
------------------------------------

I have marked this as breaking. It is really a consistency fix, but marking it 
so that folks making heavy use of supplementary characters can do extra 
checking.

h4. TL;DR

For source that contains *no* supplementary (astral-plane) characters, the fix 
is provably a *no-op* — nothing changes. Any breaking potential is confined to 
(a) source that actually contains supplementary characters *and* (b) tooling 
that consumed the old, inconsistent {{lastColumnNumber}}.

h4. Could it break code with no supplementary characters?

*No — and this is guaranteed, not just likely.* Both changes reduce to the 
identity when there are no surrogate pairs:
* {{String.codePointCount(0, len) == len}} for any string with no surrogate 
pairs.
* {{String.offsetByCodePoints(0, n) == n}} when no surrogate pair precedes 
index {{n}}.

Only a *valid supplementary character* (a surrogate pair) ever changes a value; 
even a lone/malformed surrogate counts as 1 under both {{length()}} and 
{{codePointCount()}}. So ASCII/BMP source — effectively all existing code — 
produces *bit-for-bit identical* AST column numbers and power-assert output. 
The {{PositionConfigureUtils}} change touches every node's 
{{lastColumnNumber}}, but for BMP source that value is unchanged.

h4. Scope of each change

* *{{SourceText}}* — local to power-assert and anything reusing {{SourceText}}. 
Identity for BMP; for astral source it now captures the *complete* line instead 
of a truncated one.
* *{{PositionConfigureUtils}}* — global: it changes {{lastColumnNumber}} for 
*every* node. Identity for BMP; for astral source the end column is now 
code-point based.

h4. How it could break (only when supplementary characters are present)

* *Tools that treat {{lastColumnNumber}} as a UTF-16 offset* (IDE/editor 
integrations mapping AST positions to UTF-16 document offsets, source-snippet 
extractors doing {{line.substring(col-1, lastCol-1)}}). Previously the end 
column counted a supplementary char inside the final token as 2 (UTF-16 units); 
now it counts it as 1 (code point). Such a consumer will now land *earlier* by 
the number of supplementary chars in the final token. Note this only aligns the 
*end* with the *start*: {{columnNumber}} was already code-point based, so these 
tools were already off on the start side for astral source — the change makes 
both ends consistent rather than introducing a new basis.
* *Snapshot/golden tests* that pin exact column numbers, syntax-error caret 
positions, or power-assert message text for source containing emoji. The 
power-assert message in particular changes from truncated to complete, so any 
test asserting the old (buggy) output needs updating.
* *Downstream AST consumers* (linters, contract/spec frameworks, code-coverage 
source mapping) that derived ranges from {{lastColumnNumber}} over astral 
source — they shift from the old hybrid value to a consistent code-point value.

h4. Net characterization

This is best framed as a *consistency fix*: {{columnNumber}} (start) was 
already code-point based; {{lastColumnNumber}} (end) now matches. A consumer 
that correctly converts code-point columns to UTF-16 indices *benefits*; a 
consumer that relied on the old hybrid end as an accidental approximation could 
*regress* — but in both cases *only for source containing supplementary 
characters*. There is no behavioral change for code without them.

> SourceText slices UTF-16 with code-point AST columns, truncating source for 
> supplementary characters
> ----------------------------------------------------------------------------------------------------
>
>                 Key: GROOVY-12085
>                 URL: https://issues.apache.org/jira/browse/GROOVY-12085
>             Project: Groovy
>          Issue Type: Bug
>            Reporter: Paul King
>            Assignee: Paul King
>            Priority: Major
>              Labels: breaking
>
> Power-assert renders truncated source text for any assertion containing 
> supplementary (astral-plane) characters such as emoji. Anything reusing 
> {{SourceText}} (e.g. groovy-contracts/groovy-verify source capture) is 
> affected, and the same column mismatch mis-positions syntax-error carets.
> h4. Steps to reproduce
> {code:groovy}
> assert (true ? '🥤🐝' : 'z').length() == 999
> {code}
> The rendered expression is cut short by one character per emoji preceding the 
> end (e.g. {{== 999}} renders as {{== 9}}).
> h4. Root cause
> AST column numbers are *code-point*-based ({{GroovyLangLexer}} uses 
> {{CharStreams.fromReader}} -> ANTLR {{CodePointCharStream}}; 
> {{PositionConfigureUtils}} sets {{columnNumber = getCharPositionInLine() + 
> 1}}), but {{SourceText}} slices a *UTF-16* {{String}} from 
> {{SourceUnit.getSample()}} with {{substring()}} using those columns. Each 
> astral char before the slice boundary under-counts the UTF-16 index by one, 
> cutting the slice short.
> Compounding this, {{PositionConfigureUtils}} computes {{lastColumnNumber = 
> getCharPositionInLine() + 1 + token.getText().length()}} -- a code-point 
> start plus a UTF-16 length. The UTF-16 term accidentally compensates for 
> astral chars *inside* the final token (so "emoji as last token" works) but 
> not for those *before* it, which is why the bug looks intermittent.
> h4. Suggested fix
> Make column numbers uniformly code-point-based 
> ({{token.getText().codePointCount(0, len)}} in {{PositionConfigureUtils}}), 
> then convert code-point columns to UTF-16 indices at the slice sites via 
> {{String.offsetByCodePoints}} ({{SourceText}}, and the syntax-error message 
> renderers). Note: converting the hybrid {{lastColumnNumber}} without the 
> {{codePointCount}} fix first will overshoot/throw, so both changes are needed 
> together.
> h4. Affects
> All current versions (present on master). Affects power-assert rendering and 
> any consumer of {{SourceText}}/AST column positions for supplementary 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (GROOVY-12085) SourceText slices UTF-16 with code-point AST columns, truncating source for supplementary characters

Reply via email to