Re: toStringz or not toStringz

Steven Schveighoffer Wed, 13 Jul 2011 09:05:52 -0700

On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <[email protected]>wrote:

I am suggesting the compiler will perform a special operation on allchar* parameters passed to extern "C" functions.
The operation is a toStringz like operation which is (more or less) asfollows:
1. If there is a \0 character inside foo[0..$], do nothing.

This is an O(n) operation -- too much overhead. Especially if you alreadyknow foo has a 0 in it. Note that toStringz does not have this overhead.

2. If the array allocated memory is > the array length, place a \0 atfoo[$]

The check to see if the array has allocated length requires a GC lock, andO(lgn) search for the block info in the GC.

Not that it doesn't already happen in toStringz, but I just want to pointout that it's not a small cost.

3. Reallocate the array memory, updating foo, place a \0 at foo[$]
4. Call the C function passing foo.ptr

So, it will handle all the following cases:

char[] foo;
.. code to populate foo ..

ucase(foo);
ucase(foo.ptr);

I read in your responses below, this is due to you making this equivalentto ucase(foo)? This still has the same problems I listed above.


What about

char * foo;
.. code to populate foo ..
ucase(foo);

Is there still anything special done by the compiler?

ucase(toStringz(foo));
The problem cases are the buffer cases I mentioned earlier, and theywouldn't be a problem if char was initialised to \0 as I first imagined.

The largest problem I've had with all this is there is a necessaryoverhead of conversion. Not only that, but due to the way reallocationworks, there may be a move of data. I think it's better to requireexplicit calls incurring such overhead vs. hiding the overhead calls fromthe developer. Especially if the overhead calls are unnecessary.

Other replies inline below..
On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer<[email protected]> wrote:
On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <[email protected]>wrote:
Replace foo with foo.ptr, it makes no difference to the point I wasmaking.
You fix does not help in that case, foo.ptr will be passed as anon-null terminated string.
No, see above.

How does your proposal know that a char * is part of a heap-allocatedarray? If you are assuming the only case where char * is passed will bearr.ptr, then that doesn't cut it. What if the compiler doesn't knowwhere the char * came from?

The inherent problem of zero-terminated strings is that you don't know howlong it is until you search for a zero. If it's not properly terminated,then you are screwed. That problem cannot be "solved", even with compilerhelp -- you can get situations where there is no more information otherthan the pointer.

So, your proposal fixes the case:
1. The user tries to pass a string/char[] to a C function. Fails tocompile.2. Instead of trying to understand the issue, realizes the .ptr memberis the right type, and switches to that.
It does not fix or help with cases where:
* a programmer notices the type of the parameter is char * and usesfoo.ptr without trying foo first. (crash)* a programmer calls toStringz without going through the compile/fixcycle above.* a programmer tries to pass string/char[], fails to compile, thenlooks up how to interface with C and finds toStringz
I think this fix really doesn't solve a very common problem.
See above, my intention was to solve all the cases listed here as Isuspect the compiler can detect them all, and just 'do the right thing'.
In these cases..
1. If the programmer writes foo.ptr, the compiler detects that, callstoStringz on 'foo' (not foo.ptr) and updates foo as required (ifreallocation occurs).

What if it's not foo.ptr? What if it's some random char * whose originthe compiler isn't aware of?

2. If the programmer calls toStringz, this case is the same as #1 astoStringz returns foo.ptr (I assume).

Huh? Why should it do anything with toStringz? I'm not getting this one,toStringz already has done the work your proposal wants to do.

This is not a 'new' problem introduced the idea, it's a generalproblem for D/arrays/slices and the same happens with an append,right? In which case it's not a reason against the idea.
It's new to the features of the C function being called. If you lookup the man page for such a hypothetical function, it might claim thatit alters the data passed in through the argument, but it seems to notbe the case! So there's no way for someone (who arguably is not wellversed in C functions if they didn't know to use toStringz) to figureout why the code seems not to do what it says it should. Such aprogrammer may blame either the implementation of the C function, orblame the D compiler for not calling the function properly.
None of this is relevant, let me explain..
My idea is for the compiler to detect a char* parameter to an extern "C"function and to call toStringz. When it does so it will correctlyupdate the slice/array being passed if reallocation occurs. The Cfunction will write to the slice/array being passed. So, it's notrelevant if there was another slice referencing the array before it wasreallocated, because that case is no different to calling a D functionwhich does something similar, like appending to the passed slice/array.


What about this case?

char buffer[12];
buffer[] = "hello, world";

ucase(buffer[]); // does nothing to buffer!

I'm saying, the charter of the function is to update a string in place,and your proposal is making that not true in some cases.

The goal is to make a call to an extern "C" function "just work" in thesame way as calling Win32/C functions "just work" from C# .. which alsohas it's own string type.

This is very different. C#'s strings are full reference types, so addinga '0' at the end affects all references to that string, reallocation ornot.

toStringz does not currently check for '\0' anywhere in the existingstring. It simply appends '\0' to the end of the passed string. Ifyou want it to check for '\0', how far should it go? Doesn't this alsoadd to the overhead (looping over all chars looking for '\0')?
Note also, that toStringz has old code that used to check for "one bytebeyond" the array, but this is commented out, because it's unreliable(could cause a segfault).
So, toStringz is not as clever as I imagined. I thought it wouldintelligently detect cases where a \0 was already present in the slice(from 0 to $) and if not, put one at $+1 (inside pre-allocated arraymemory). I was assuming toStringz had access to the underlying arrayallocation size and would know how far it can 'look' without causing asegfault. In the case where the slice length equaled the array reservedmemory area, it would re-allocate and place the \0 at $+1 (inside thenewly allocated memory).


s/clever/slow/

The only "intelligent" way to check for a 0 is a linear search.

Without knowing where the data came from, there is no way to look past theslice without possibly calling a segfault. If you know it's a heapallocation, you can look at the block information to see if you can lookpast it. This might be possible to do for toStringz, but the linear checkfor 0 is just unacceptable for a simple function call. Appending a 0 isat least amortized. One thing though, it could make some smarterdecisions as to whether to reallocate depending on the type of the array,since it is already doing a lookup of block info.

But I still always come back to the fact that I should be able tocircumvent some auto-intelligent decision that isn't aware of things thata developer can be aware of (such as knowing an array already contains a0). The compiler shouldn't be too intrusive here.


-Steve

Re: toStringz or not toStringz

Reply via email to