Re: Using libutf8proc for svn_utf_cstring_utf8_width()

Evgeny Kotkov via dev Tue, 19 May 2026 12:42:50 -0700

Timofei Zhakov <[email protected]> writes:

> Committed the test in r1934405 and the rest of the patch in r1934407.
> Removed svn_utf__cstring_is_valid() check in r1934408.


Thanks for the change!  I like the idea, and the implementation looks
good to me overall.

I'm a bit late to the party, but here are a few minor comments and
suggestions regarding the code:

+  /* Convert the UTF-8 string to UTF-32 (UCS4) which is the format
+   * utf8proc_charwidth() expects, and get the width of each character.
+   * We don't need much error checking since the input is valid UTF-8. */
+  while (*cstr)

After r1934408, the comment stating that "the input is valid UTF-8" seems
outdated, because we no longer check for valid UTF-8 in advance.

+      int nbytes = utf8proc_iterate((apr_byte_t*)cstr, -1, &ucs);

Since `cstr` is a `const char *`, I think this cast unnecessarily discards
constness.

Also, because utf8proc_iterate() expects a `const utf8proc_uint8_t *`,
I'd say that casting directly to that target type is safer than using
`apr_byte_t *`, because the latter could theoretically denote a
different underlying type.

A couple of similar observations:

- utf8proc_iterate() returns a utf8proc_ssize_t.  Assigning this to
  `int nbytes` risks truncation in environments where int is smaller
  than utf8proc_ssize_t:

+      int nbytes = utf8proc_iterate((apr_byte_t*)cstr, -1, &ucs);

- utf8proc_iterate() puts the codepoint result in a utf8proc_int32_t,
  whereas we currently type it as an apr_int32_t:

+      apr_int32_t ucs;

  While those types are almost certainly compatible, I think that changing
  it to `utf8proc_int32_t` would guarantee that we use a correct type.
  It would also ensure that `utf8proc_charwidth(utf8proc_int32_t c)` gets
  called with an argument of the exactly expected type.


Thanks,
Evgeny Kotkov

Re: Using libutf8proc for svn_utf_cstring_utf8_width()

Reply via email to