[jira] [Updated] (GROOVY-12085) SourceText slices UTF-16 with code-point AST columns, truncating source for supplementary characters

Paul King (Jira) Fri, 12 Jun 2026 17:47:47 -0700


     [ 
https://issues.apache.org/jira/browse/GROOVY-12085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul King updated GROOVY-12085:
-------------------------------
    Description: 
Power-assert renders truncated source text for any assertion containing 
supplementary (astral-plane) characters such as emoji. Anything reusing 
{{SourceText}} (e.g. groovy-contracts/groovy-verify source capture) is 
affected, and the same column mismatch mis-positions syntax-error carets.

h4. Steps to reproduce

{code:groovy}
assert (true ? '🥤🐝' : 'z').length() == 999
{code}

The rendered expression is cut short by one character per emoji preceding the 
end (e.g. {{== 999}} renders as {{== 9}}).

h4. Root cause

AST column numbers are *code-point*-based ({{GroovyLangLexer}} uses 
{{CharStreams.fromReader}} -> ANTLR {{CodePointCharStream}}; 
{{PositionConfigureUtils}} sets {{columnNumber = getCharPositionInLine() + 
1}}), but {{SourceText}} slices a *UTF-16* {{String}} from 
{{SourceUnit.getSample()}} with {{substring()}} using those columns. Each 
astral char before the slice boundary under-counts the UTF-16 index by one, 
cutting the slice short.

Compounding this, {{PositionConfigureUtils}} computes {{lastColumnNumber = 
getCharPositionInLine() + 1 + token.getText().length()}} -- a code-point start 
plus a UTF-16 length. The UTF-16 term accidentally compensates for astral chars 
*inside* the final token (so "emoji as last token" works) but not for those 
*before* it, which is why the bug looks intermittent.

h4. Suggested fix

Make column numbers uniformly code-point-based 
({{token.getText().codePointCount(0, len)}} in {{PositionConfigureUtils}}), 
then convert code-point columns to UTF-16 indices at the slice sites via 
{{String.offsetByCodePoints}} ({{SourceText}}, and the syntax-error message 
renderers). Note: converting the hybrid {{lastColumnNumber}} without the 
{{codePointCount}} fix first will overshoot/throw, so both changes are needed 
together.

h4. Affects

All current versions (present on master). Affects power-assert rendering and 
any consumer of {{SourceText}}/AST column positions for supplementary 
characters.

> SourceText slices UTF-16 with code-point AST columns, truncating source for 
> supplementary characters
> ----------------------------------------------------------------------------------------------------
>
>                 Key: GROOVY-12085
>                 URL: https://issues.apache.org/jira/browse/GROOVY-12085
>             Project: Groovy
>          Issue Type: Bug
>            Reporter: Paul King
>            Assignee: Paul King
>            Priority: Major
>
> Power-assert renders truncated source text for any assertion containing 
> supplementary (astral-plane) characters such as emoji. Anything reusing 
> {{SourceText}} (e.g. groovy-contracts/groovy-verify source capture) is 
> affected, and the same column mismatch mis-positions syntax-error carets.
> h4. Steps to reproduce
> {code:groovy}
> assert (true ? '🥤🐝' : 'z').length() == 999
> {code}
> The rendered expression is cut short by one character per emoji preceding the 
> end (e.g. {{== 999}} renders as {{== 9}}).
> h4. Root cause
> AST column numbers are *code-point*-based ({{GroovyLangLexer}} uses 
> {{CharStreams.fromReader}} -> ANTLR {{CodePointCharStream}}; 
> {{PositionConfigureUtils}} sets {{columnNumber = getCharPositionInLine() + 
> 1}}), but {{SourceText}} slices a *UTF-16* {{String}} from 
> {{SourceUnit.getSample()}} with {{substring()}} using those columns. Each 
> astral char before the slice boundary under-counts the UTF-16 index by one, 
> cutting the slice short.
> Compounding this, {{PositionConfigureUtils}} computes {{lastColumnNumber = 
> getCharPositionInLine() + 1 + token.getText().length()}} -- a code-point 
> start plus a UTF-16 length. The UTF-16 term accidentally compensates for 
> astral chars *inside* the final token (so "emoji as last token" works) but 
> not for those *before* it, which is why the bug looks intermittent.
> h4. Suggested fix
> Make column numbers uniformly code-point-based 
> ({{token.getText().codePointCount(0, len)}} in {{PositionConfigureUtils}}), 
> then convert code-point columns to UTF-16 indices at the slice sites via 
> {{String.offsetByCodePoints}} ({{SourceText}}, and the syntax-error message 
> renderers). Note: converting the hybrid {{lastColumnNumber}} without the 
> {{codePointCount}} fix first will overshoot/throw, so both changes are needed 
> together.
> h4. Affects
> All current versions (present on master). Affects power-assert rendering and 
> any consumer of {{SourceText}}/AST column positions for supplementary 
> characters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (GROOVY-12085) SourceText slices UTF-16 with code-point AST columns, truncating source for supplementary characters

Reply via email to