Hi, Internals

I noticed grapheme cluster is not limit code points in UAX#29.
https://www.unicode.org/reports/tr29/

And there is no limit code point in Unicode that confirmed in issue of ICU.
https://unicode-org.atlassian.net/browse/ICU-23302

So that means create many code points in 1 grapheme cluster,
That is crash for program because computer resource is limited.

For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.txt
```
php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466}\u
{200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_limit=600M >
emoji_bomb.txt
```
(PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH)

So, I think we(php-src, programming language level) need to create new
custom limit function.
My idea is below:

```
grapheme_limit_codepoints(string $str, integer $max_codepoints = 32): bool
```

I don't have heavy opinion that $max_codepoints is 32.
However, 32 code points is enough of grapheme cluster because
human language max code points is maybe Hakṣhmalawarayaṁ(ཧྐྵྨླྺྼྻྂ) in
9 code points.

If need more than code points in grapheme cluster,
Userland can to increase $max_codepoints.

Please see also my speakerdeck.
https://speakerdeck.com/youkidearitai/limit-of-code-point-for-grapheme-cluster

What do you think about this idea?

Regards
Yuya

-- 
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-----------------------------

Reply via email to