https://issues.dlang.org/show_bug.cgi?id=16090
Issue ID: 16090
Summary: popFront generates out-of-bounds array index on
corrupted utf-8 strings
Product: D
Version: D2
Hardware: x86
OS: Mac OS X
Status: NEW
Severity: normal
Priority: P1
Component: phobos
Assignee: [email protected]
Reporter: [email protected]
If a utf-8 string is chopped (terminated) in the middle of a multi-byte utf-8
character, popFront will generate an out-of-bounds array index. If compiled
with -boundscheck=on, a popFront generates a core.exception.RangeError. With
-boundscheck=off, an undetermined behavior. In the program below, in my tests
the while looped forever until generating a bus error.
void main(string[] args) {
import std.stdio;
import std.range;
auto s = "aä";
auto corrupted = s[0 .. $-1];
auto n = 0;
while (!corrupted.empty) {
corrupted.popFront;
n++;
}
writeln(n);
}
In this program, the 'ä' character is a two utf-8 sequence. Dropping the last
byte leaving an incomplete utf-8 code point.
The reason this is so problematic is that string processing often involves
corrupted strings, in particular, strings read at run-time from input sources.
In the sample program above it can be said that this is a programmer error.
However, if the string is read from an outside source, the program needs to be
able to defend against corrupted strings.
It appears this arises problem from this code in popFront (isNarrowString),
currently line 2076 in std/range/primitives.d:
import core.bitop : bsr;
auto msbs = 7 - bsr(~c);
if ((msbs < 2) | (msbs > 6))
{
//Invalid UTF-8
msbs = 1;
}
str = str[msbs .. $];
The msbs variable is holding the length of the utf-8 code point as indicated by
the first byte. The 'str[msbs .. $]' expression assumes the string is long
enough to hold the full code point.
Beside being problematic for practical applications, it is inconsistent with
other auto-decoding behavior. The 'front' routine will throw a
std.utf.UTFException in this situation. And, popFront itself handles the case
of an invalid first byte differently, by simply moving past it.
--