On 7/24/11 7:41 PM, Johann MacDonagh wrote:
Both toStringz and toUTFz do something potentially unsafe. Both check
whether the character after the end of the string is NULL. If so, then
it simply returns a pointer to the original string. This is a good
optimization in theory because this code:

string s = "abc";

will be a slice to a read-only section of the executable. The compiler
will insert a NULL after the string in the read-only section. So this:

auto x = toStringz("abc");

is efficient. No relocations.

As @AndrejMitrovic commented in Phobos pull request 123
https://github.com/D-Programming-Language/phobos/pull/123, this has
potential issues:

import std.string;
import std.stdio;

struct A
{
immutable char[2] foo;
char[2] bar;
}

void main()
{
auto a = A("aa", "\0b");
auto charptr = toStringz(a.foo[]);

a.bar = "bo";
printf(charptr); // two chars, then garbage
}

Another issue not mentioned is with slices. If I do...

string s = "abc";
string y = s[];
string z = y[];

z ~= '\0';

auto c = toStringz(y);

assert(c.ptr == y.ptr);

... what happens if I change that last character of z before I pass c to
the C routine? Bad news. I think this optimization is great, but doesn't
it go against D's motto of doing "the right thing by default"?

The question is, how can we keep this optimization so that:

toStringz("abc");

remains efficient?

The capacity field is 0 if the string is in a read-only section *or* if
the string is on the stack:

auto x = "abc";
assert(x.capacity == 0);
char[3] y = "abc";
assert(x.capacity == 0);

So, this isn't safe either. This code:

char[3] x = "abc";
char y = '\0';

will put y right after x, so changing y after calling toStringz will
cause issues.

In reality, the only time it's safe to do the "peek after end" is if the
string is in the read-only section. Otherwise, there are potential
issues (even if they are edge cases).

Do we care about this? Is there something we can add to druntime arrays
that will tell whether or not the data backing a slice is in read-only
memory (or perhaps an enum: read-only, stack, heap, other)? In reality,
the only time this changes is when a read-only / stack heap is appended
to, so performance issues are minimal.

Comments? Ideas?

I'm not too worried. I think it is fair to guarantee the pointer returned from toStringz is guaranteed to point to a zero-terminated string only up until the first relevant change. It is difficult to define what a relevant change is, but practically I think it is understood what's going on. If there's a need for a persistent stringz, creating a private copy immediately after the call to toStringz is always an option.


Andrei

Reply via email to