Re: [PHP-DEV] [Discussion] grapheme cluster for str_split function

youkidearitai Tue, 05 Mar 2024 00:26:18 -0800

>
> Hi, Niels
>
> Thank you for your comment.
> Indeed, returns false is make sense.
>
> Therefore, I changed to returns false when invalid UTF-8 strings.
>
> Regards
> Yuya
>
> --
> ---------------------------
> Yuya Hamada (tekimen)
> - https://tekitoh-memdhoi.info
> - https://github.com/youkidearitai
> -----------------------------


Sorry, again.
I checked behavior of mb_str_split function. So Illegal byte sequences
are returned as is.

```
sapi/cli/php -r 'var_dump(mb_str_split("あ\xc2\xf4\x80あ"));'
array(4) {
  [0]=>
  string(3) "あ"
  [1]=>
  string(2) "��"
  [2]=>
  string(1) "�"
  [3]=>
  string(3) "あ"
}
```

And, I reading ICU document about utext_openUTF8 (below is link):
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utext_8h.html#a130e7cba201c4b38799b432eb269f6d5

> Any invalid UTF-8 in the input will be handled in this way: a sequence of 
> bytes that has the form of a truncated, but otherwise valid, UTF-8 sequence 
> will be replaced by a single unicode replacement character, \uFFFD. Any other 
> illegal bytes will each be replaced by a \uFFFD.

Therefore, I think encoding check is not need.
Returns only arrays together with mb_str_split.

Regards
Yuya


-- 
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-----------------------------

Re: [PHP-DEV] [Discussion] grapheme cluster for str_split function

Reply via email to