Hi,
First of all, thank you for the great Unicode library in C. Recently I've been
using the library intensively.
During my experiments with libunistring-0.9.5, I've found an error in
u-strtok.h as below. The lines starting with "////" are my changes.
/* Move past the token. */
{
UNIT *token_end = U_STRPBRK (str, delim);
if (token_end)
{
/* NUL-terminate the token. */
*token_end = 0;
*ptr = token_end + 1;
//// These lines should be something like below.
//// *ptr = token_end + (sizeof(uint8_t) * u8_strmblen(token_end));
//// *token_end = 0;
}
else
*ptr = NULL;
}
So the original code tries to start the next search without checking how many
bytes are actually taken by a matched delimiter but assuming 1 by "token_end +
1". When the delimiter takes more than one UNIT such as a delimiter in
Japanese, this assumption fails and starts the next search from an invalid
location which is in the middle of a Unicode character.
To solve the issue, one can define U_STRMBLEN with u8_strmblen,u16_strmblen and
u32_strmblen accordingly and call it like *ptr = token_end + (sizeof(UNIT) *
U_STRMBLEN(token_end)) instead of *ptr = token_end + 1.
I've checked the source code and git log but could not find the relevant
changes. However, if I've missed the change or misunderstood the logic and if
it works as expected as it is, please discard this email.
Thank you once again for your great work.
Seiya
Seiya Kawashima
Intermediate Application Programmer | Department of Radiology
The University of Chicago Biological Science
5841 S. Maryland Ave. | Rm. IB-012 | Chicago, IL 60637
Office: 773-834-1791