[bug-libunistring] A bug in u-strtok.h and the fix in libunistring-0.9.5

Seiya Kawashima Thu, 02 Jul 2015 13:47:18 -0700

Hi,

First of all, thank you for the great Unicode library in C. Recently I've been 
using the library intensively.


During my experiments with libunistring-0.9.5, I've found an error in 
u-strtok.h as below. The lines starting with "////" are my changes.

  /* Move past the token. */
  {
    UNIT *token_end = U_STRPBRK (str, delim);
    if (token_end)
      {
        /* NUL-terminate the token.  */
        *token_end = 0;
        *ptr = token_end + 1;
        //// These lines should be something like below.
        //// *ptr = token_end + (sizeof(uint8_t) * u8_strmblen(token_end));
       ////  *token_end = 0;
      }
    else
      *ptr = NULL;
  }

So the original code tries to start the next search without checking how many 
bytes are actually taken by a matched delimiter but assuming 1 by "token_end + 
1". When the delimiter takes more than one UNIT such as a delimiter in 
Japanese, this assumption fails and starts the next search from an invalid 
location which is in the middle of a Unicode character.

To solve the issue, one can define U_STRMBLEN with u8_strmblen,u16_strmblen and 
u32_strmblen accordingly and call it like *ptr = token_end + (sizeof(UNIT) * 
U_STRMBLEN(token_end)) instead of *ptr = token_end + 1.

I've checked the source code and git log but could not find the relevant 
changes. However, if I've missed the change or misunderstood the logic and if 
it works as expected as it is, please discard this email.

Thank you once again for your great work.
Seiya

Seiya Kawashima
Intermediate Application Programmer | Department of Radiology

The University of Chicago Biological Science
5841 S. Maryland Ave. | Rm. IB-012 | Chicago, IL 60637
Office: 773-834-1791

[bug-libunistring] A bug in u-strtok.h and the fix in libunistring-0.9.5

Reply via email to