2026年2月24日(火) 11:38 Kentaro Takeda <[email protected]>:
>
> Hi Yuya,
>
> I think this is a good idea. While spec compliance is generally desirable, 
> DoS via unbounded grapheme clusters is a real threat, and it's reasonable for 
> a language-level implementation to impose practical limits that the Unicode 
> spec itself doesn't define. This kind of gap between a general-purpose spec 
> and a concrete implementation is not unusual.
>
> The default of 32 code points sounds sensible given that natural language 
> grapheme clusters top out well below that.
>
> One minor note: it might help to clarify the intended behavior of 
> `grapheme_limit_codepoints` a bit more — for instance, whether it is meant as 
> a validation check (returning false when a cluster exceeds the limit) or 
> something else.
>
> Regards,
> Kentaro Takeda
>
>
> 2026年2月23日(月) 20:28 youkidearitai <[email protected]>:
>>
>> Hi, Internals
>>
>> I noticed grapheme cluster is not limit code points in UAX#29.
>> https://www.unicode.org/reports/tr29/
>>
>> And there is no limit code point in Unicode that confirmed in issue of ICU.
>> https://unicode-org.atlassian.net/browse/ICU-23302
>>
>> So that means create many code points in 1 grapheme cluster,
>> That is crash for program because computer resource is limited.
>>
>> For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.txt
>> ```
>> php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466}\u
>> {200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_limit=600M >
>> emoji_bomb.txt
>> ```
>> (PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH)
>>
>> So, I think we(php-src, programming language level) need to create new
>> custom limit function.
>> My idea is below:
>>
>> ```
>> grapheme_limit_codepoints(string $str, integer $max_codepoints = 32): bool
>> ```
>>
>> I don't have heavy opinion that $max_codepoints is 32.
>> However, 32 code points is enough of grapheme cluster because
>> human language max code points is maybe Hakṣhmalawarayaṁ(ཧ) in
>> 9 code points.
>>
>> If need more than code points in grapheme cluster,
>> Userland can to increase $max_codepoints.
>>
>> Please see also my speakerdeck.
>> https://speakerdeck.com/youkidearitai/limit-of-code-point-for-grapheme-cluster
>>
>> What do you think about this idea?
>>
>> Regards
>> Yuya
>>
>> --
>> ---------------------------
>> Yuya Hamada (tekimen)
>> - https://tekitoh-memdhoi.info
>> - https://github.com/youkidearitai
>> -----------------------------

Hi, Kentaro

Thank you very much for your feedback.

> One minor note: it might help to clarify the intended behavior of 
> `grapheme_limit_codepoints` a bit more — for instance, whether it is meant as 
> a validation check (returning false when a cluster exceeds the limit) or 
> something else.

Okay. I'll show you.

```
// something string in $_POST['text']
// Validate many code points in a grapheme cluster.
if (grapheme_limit_codepoints($_POST['text'], 32) !== true) {
   throw new InvalidException("Found invalid / many code points in
grapheme cluster");
}

// Validate grapheme cluster length
if (grapheme_strlen($_POST['text']) > 100) {
  throw new InvalidException("Invalid grater than 100 graphemes");
}

// do anything...
```
The intention is "count correct graphemes avoid DoS".
And I want to overcoming to
https://github.com/symfony/symfony/pull/13527 in grapheme_strlen
function.

Feel free to more comment.
Regards
Yuya.

-- 
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-----------------------------

Reply via email to