> On Dec 13, 2025, at 07:24, Jeff Davis <[email protected]> wrote:
>
> Attached.
>
>
> <v1-0001-Make-utf8_to_unicode-safer.patch>
This patch adds a length check to utf8_to_unicode(), I think which is where
“safety” comes from. Can you please add an a little bit more to the commit
message instead of only saying “improve safety”. It also deleted two redundant
function declarations from pg_wchar.h, which may also worth a quick note in the
commit message.
The code changes all look good to me. Only nitpicks are:
1
```
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c
b/contrib/fuzzystrmatch/daitch_mokotoff.c
index 07f895ae2bf..47bd2814460 100644
--- a/contrib/fuzzystrmatch/daitch_mokotoff.c
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -401,7 +401,8 @@ read_char(const unsigned char *str, int *ix)
/* Decode UTF-8 character to ISO 10646 code point. */
str += *ix;
- c = utf8_to_unicode(str);
+ /* Assume byte sequence has not been broken. */
+ c = utf8_to_unicode(str, MAX_MULTIBYTE_CHAR_LEN);
```
Here we need an empty line above the new comment.
2
```
diff --git a/src/common/wchar.c b/src/common/wchar.c
index a4bc29921de..c113cadf815 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -661,7 +661,8 @@ ucs_wcwidth(pg_wchar ucs)
static int
pg_utf_dsplen(const unsigned char *s)
{
- return ucs_wcwidth(utf8_to_unicode(s));
+ /* trust that input is not a truncated byte sequence */
+ return ucs_wcwidth(utf8_to_unicode(s, MAX_MULTIBYTE_CHAR_LEN));
}
```
For the new comment, as a code reader, I wonder why we “trust” that? To me, it
more feels like because of lacking length information, we have to trust. I
would like this comment to be enhanced a little bit with more information.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/