On 2011-07-13 09:00, Steven Schveighoffer wrote: > On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <[email protected]> > > wrote: > > I am suggesting the compiler will perform a special operation on all > > char* parameters passed to extern "C" functions. > > > > The operation is a toStringz like operation which is (more or less) as > > follows: > > > > 1. If there is a \0 character inside foo[0..$], do nothing. > > This is an O(n) operation -- too much overhead. Especially if you already > know foo has a 0 in it. Note that toStringz does not have this overhead. > > > 2. If the array allocated memory is > the array length, place a \0 at > > foo[$] > > The check to see if the array has allocated length requires a GC lock, and > O(lgn) search for the block info in the GC. > > Not that it doesn't already happen in toStringz, but I just want to point > out that it's not a small cost. > > > 3. Reallocate the array memory, updating foo, place a \0 at foo[$] > > 4. Call the C function passing foo.ptr > > > > So, it will handle all the following cases: > > > > char[] foo; > > .. code to populate foo .. > > > > ucase(foo); > > ucase(foo.ptr); > > I read in your responses below, this is due to you making this equivalent > to ucase(foo)? This still has the same problems I listed above. > > What about > > char * foo; > .. code to populate foo .. > ucase(foo); > > Is there still anything special done by the compiler? > > > ucase(toStringz(foo)); > > > > The problem cases are the buffer cases I mentioned earlier, and they > > wouldn't be a problem if char was initialised to \0 as I first imagined. > > The largest problem I've had with all this is there is a necessary > overhead of conversion. Not only that, but due to the way reallocation > works, there may be a move of data. I think it's better to require > explicit calls incurring such overhead vs. hiding the overhead calls from > the developer. Especially if the overhead calls are unnecessary. > > > Other replies inline below.. > > > > On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer > > > > <[email protected]> wrote: > >> On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <[email protected]> > >> > >> wrote: > >>> Replace foo with foo.ptr, it makes no difference to the point I was > >>> making. > >> > >> You fix does not help in that case, foo.ptr will be passed as a > >> non-null terminated string. > > > > No, see above. > > How does your proposal know that a char * is part of a heap-allocated > array? If you are assuming the only case where char * is passed will be > arr.ptr, then that doesn't cut it. What if the compiler doesn't know > where the char * came from? > > The inherent problem of zero-terminated strings is that you don't know how > long it is until you search for a zero. If it's not properly terminated, > then you are screwed. That problem cannot be "solved", even with compiler > help -- you can get situations where there is no more information other > than the pointer. > > >> So, your proposal fixes the case: > >> > >> 1. The user tries to pass a string/char[] to a C function. Fails to > >> compile. > >> 2. Instead of trying to understand the issue, realizes the .ptr member > >> is the right type, and switches to that. > >> > >> It does not fix or help with cases where: > >> * a programmer notices the type of the parameter is char * and uses > >> > >> foo.ptr without trying foo first. (crash) > >> > >> * a programmer calls toStringz without going through the compile/fix > >> > >> cycle above. > >> > >> * a programmer tries to pass string/char[], fails to compile, then > >> > >> looks up how to interface with C and finds toStringz > >> > >> I think this fix really doesn't solve a very common problem. > > > > See above, my intention was to solve all the cases listed here as I > > suspect the compiler can detect them all, and just 'do the right thing'. > > > > In these cases.. > > > > 1. If the programmer writes foo.ptr, the compiler detects that, calls > > toStringz on 'foo' (not foo.ptr) and updates foo as required (if > > reallocation occurs). > > What if it's not foo.ptr? What if it's some random char * whose origin > the compiler isn't aware of? > > > 2. If the programmer calls toStringz, this case is the same as #1 as > > toStringz returns foo.ptr (I assume). > > Huh? Why should it do anything with toStringz? I'm not getting this one, > toStringz already has done the work your proposal wants to do. > > >>> This is not a 'new' problem introduced the idea, it's a general > >>> problem for D/arrays/slices and the same happens with an append, > >>> right? In which case it's not a reason against the idea. > >> > >> It's new to the features of the C function being called. If you look > >> up the man page for such a hypothetical function, it might claim that > >> it alters the data passed in through the argument, but it seems to not > >> be the case! So there's no way for someone (who arguably is not well > >> versed in C functions if they didn't know to use toStringz) to figure > >> out why the code seems not to do what it says it should. Such a > >> programmer may blame either the implementation of the C function, or > >> blame the D compiler for not calling the function properly. > > > > None of this is relevant, let me explain.. > > > > My idea is for the compiler to detect a char* parameter to an extern "C" > > function and to call toStringz. When it does so it will correctly > > update the slice/array being passed if reallocation occurs. The C > > function will write to the slice/array being passed. So, it's not > > relevant if there was another slice referencing the array before it was > > reallocated, because that case is no different to calling a D function > > which does something similar, like appending to the passed slice/array. > > What about this case? > > char buffer[12]; > buffer[] = "hello, world"; > > ucase(buffer[]); // does nothing to buffer! > > I'm saying, the charter of the function is to update a string in place, > and your proposal is making that not true in some cases. > > > The goal is to make a call to an extern "C" function "just work" in the > > same way as calling Win32/C functions "just work" from C# .. which also > > has it's own string type. > > This is very different. C#'s strings are full reference types, so adding > a '0' at the end affects all references to that string, reallocation or > not. > > >> toStringz does not currently check for '\0' anywhere in the existing > >> string. It simply appends '\0' to the end of the passed string. If > >> you want it to check for '\0', how far should it go? Doesn't this also > >> add to the overhead (looping over all chars looking for '\0')? > >> > >> Note also, that toStringz has old code that used to check for "one byte > >> beyond" the array, but this is commented out, because it's unreliable > >> (could cause a segfault). > > > > So, toStringz is not as clever as I imagined. I thought it would > > intelligently detect cases where a \0 was already present in the slice > > (from 0 to $) and if not, put one at $+1 (inside pre-allocated array > > memory). I was assuming toStringz had access to the underlying array > > allocation size and would know how far it can 'look' without causing a > > segfault. In the case where the slice length equaled the array reserved > > memory area, it would re-allocate and place the \0 at $+1 (inside the > > newly allocated memory). > > s/clever/slow/ > > The only "intelligent" way to check for a 0 is a linear search. > > Without knowing where the data came from, there is no way to look past the > slice without possibly calling a segfault. If you know it's a heap > allocation, you can look at the block information to see if you can look > past it. This might be possible to do for toStringz, but the linear check > for 0 is just unacceptable for a simple function call. Appending a 0 is > at least amortized. One thing though, it could make some smarter > decisions as to whether to reallocate depending on the type of the array, > since it is already doing a lookup of block info. > > But I still always come back to the fact that I should be able to > circumvent some auto-intelligent decision that isn't aware of things that > a developer can be aware of (such as knowing an array already contains a > 0). The compiler shouldn't be too intrusive here.
Andrej Mitrovic found a rather annoying issue (which is fortunately highly unlikely and therefore almost certainly rare) with toStringz and toUTFz with checking for a terminating '\0' one past the end of the string (which both functions do under some circumstances). You might want to have a look at it: https://github.com/D-Programming-Language/phobos/pull/123 Given what you know about the GC and arrays, your thoughts on the matter would be welcome. - Jonathan M Davis
