Re: Small patch to improve safety of utf8_to_unicode().

Chao Li Sat, 13 Dec 2025 15:23:30 -0800

> On Dec 13, 2025, at 07:24, Jeff Davis <[email protected]> wrote:
> 
> Attached.
> 
> 
> <v1-0001-Make-utf8_to_unicode-safer.patch>


This patch adds a length check to utf8_to_unicode(), I think which is where 
“safety” comes from. Can you please add an a little bit more to the commit 
message instead of only saying “improve safety”. It also deleted two redundant 
function declarations from pg_wchar.h, which may also worth a quick note in the 
commit message.

The code changes all look good to me. Only nitpicks are:

1
```
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c 
b/contrib/fuzzystrmatch/daitch_mokotoff.c
index 07f895ae2bf..47bd2814460 100644
--- a/contrib/fuzzystrmatch/daitch_mokotoff.c
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -401,7 +401,8 @@ read_char(const unsigned char *str, int *ix)
 
        /* Decode UTF-8 character to ISO 10646 code point. */
        str += *ix;
-       c = utf8_to_unicode(str);
+       /* Assume byte sequence has not been broken. */
+       c = utf8_to_unicode(str, MAX_MULTIBYTE_CHAR_LEN);
```

Here we need an empty line above the new comment.

2
```
diff --git a/src/common/wchar.c b/src/common/wchar.c
index a4bc29921de..c113cadf815 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -661,7 +661,8 @@ ucs_wcwidth(pg_wchar ucs)
 static int
 pg_utf_dsplen(const unsigned char *s)
 {
-       return ucs_wcwidth(utf8_to_unicode(s));
+       /* trust that input is not a truncated byte sequence */
+       return ucs_wcwidth(utf8_to_unicode(s, MAX_MULTIBYTE_CHAR_LEN));
 }
```

For the new comment, as a code reader, I wonder why we “trust” that? To me, it 
more feels like because of lacking length information, we have to trust. I 
would like this comment to be enhanced a little bit with more information.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Re: Small patch to improve safety of utf8_to_unicode().

Reply via email to