On Wed, 13 Jul 2011 17:00:39 +0100, Steven Schveighoffer
<[email protected]> wrote:
On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <[email protected]>
wrote:
I am suggesting the compiler will perform a special operation on all
char* parameters passed to extern "C" functions.
The operation is a toStringz like operation which is (more or less) as
follows:
1. If there is a \0 character inside foo[0..$], do nothing.
This is an O(n) operation -- too much overhead. Especially if you
already know foo has a 0 in it. Note that toStringz does not have this
overhead.
On 2nd thought, this step is unnecessary unless the array length matches
the memory block length .. it was intended to detect an existing \0 and
avoid the reallocation. But, this case is rare so this step could be
skipped for the general case, or only carried out when the lengths match
and reallocation is a possibility we want to avoid, or not if the cost is
too high even for that.
2. If the array allocated memory is > the array length, place a \0 at
foo[$]
The check to see if the array has allocated length requires a GC lock,
and O(lgn) search for the block info in the GC.
Not that it doesn't already happen in toStringz, but I just want to
point out that it's not a small cost.
This is the cost Walter mentioned earlier. Does this mean that heap
allocated arrays do not know how much memory they have allocated? I was
assuming they held that information, and that a slice to them would also
know. How else does an array append operation know whether to
reallocate? Does it have to obtain the GC lock and perform an O(lgn)
search on every append?
3. Reallocate the array memory, updating foo, place a \0 at foo[$]
4. Call the C function passing foo.ptr
So, it will handle all the following cases:
char[] foo;
.. code to populate foo ..
ucase(foo);
ucase(foo.ptr);
I read in your responses below, this is due to you making this
equivalent to ucase(foo)? This still has the same problems I listed
above.
Problems above? You mean the cost? Yes, there is a cost to pay, but it's
a cost which has to be paid (and is already paid by calling toStringz) to
avoid corrupting memory whether it's done explicitly or implicitly. And
the cost is only paid for extern "C" functions with char* parameters. In
the rare case where the string already contains \0 and the programmer can
guarantee that, we can have some way to indicate it, or in some cases
changing the function parameter to ubyte* or byte* may be the correct
solution.
What about
char * foo;
.. code to populate foo ..
ucase(foo);
Is there still anything special done by the compiler?
Assuming foo is allocated by the GC toStringz can still find the length of
the memory block (step #2 above) and reallocate if required, right? If so
we can handle this case as well (for no extra cost than incurred by
toStringz already).
ucase(toStringz(foo));
The problem cases are the buffer cases I mentioned earlier, and they
wouldn't be a problem if char was initialised to \0 as I first imagined.
The largest problem I've had with all this is there is a necessary
overhead of conversion. Not only that, but due to the way reallocation
works, there may be a move of data. I think it's better to require
explicit calls incurring such overhead vs. hiding the overhead calls
from the developer. Especially if the overhead calls are unnecessary.
But, the overhead is something we already pay calling toStringz
explicitly, and the reallocation is no different to an append operation.
Generally speaking I would normally agree that it's better to require
explicit calls incurring overhead etc, but this specific case is something
new D programmers stumble on all the time, and it makes D look less slick
than something like C#. I do realise the target audience is slightly
different for each, but if we can achieve something similar for no extra
cost (other than we already pay calling toStringz explicitly), then it's
well worth considering.
As far as I can see the only problem cases are those where we incur more
cost than toStringz when it's not required, and those cases seem rare to
me, and could be handled by an opt-out decoration/keyword or similar.
Other replies inline below..
On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer
<[email protected]> wrote:
On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <[email protected]>
wrote:
Replace foo with foo.ptr, it makes no difference to the point I was
making.
You fix does not help in that case, foo.ptr will be passed as a
non-null terminated string.
No, see above.
How does your proposal know that a char * is part of a heap-allocated
array? If you are assuming the only case where char * is passed will be
arr.ptr, then that doesn't cut it. What if the compiler doesn't know
where the char * came from?
See your Q and my A above ("char * foo" example).
The inherent problem of zero-terminated strings is that you don't know
how long it is until you search for a zero. If it's not properly
terminated, then you are screwed. That problem cannot be "solved", even
with compiler help -- you can get situations where there is no more
information other than the pointer.
Really? But cant we obtain the GC lock and look them up, as mentioned
above? And isn't this exactly what toStringz will do when the programmer
first of all curses because it has crashed, and then adds an explicit
toStringz call?
So, your proposal fixes the case:
1. The user tries to pass a string/char[] to a C function. Fails to
compile.
2. Instead of trying to understand the issue, realizes the .ptr member
is the right type, and switches to that.
It does not fix or help with cases where:
* a programmer notices the type of the parameter is char * and uses
foo.ptr without trying foo first. (crash)
* a programmer calls toStringz without going through the compile/fix
cycle above.
* a programmer tries to pass string/char[], fails to compile, then
looks up how to interface with C and finds toStringz
I think this fix really doesn't solve a very common problem.
See above, my intention was to solve all the cases listed here as I
suspect the compiler can detect them all, and just 'do the right thing'.
In these cases..
1. If the programmer writes foo.ptr, the compiler detects that, calls
toStringz on 'foo' (not foo.ptr) and updates foo as required (if
reallocation occurs).
What if it's not foo.ptr? What if it's some random char * whose origin
the compiler isn't aware of?
See above.
2. If the programmer calls toStringz, this case is the same as #1 as
toStringz returns foo.ptr (I assume).
Huh? Why should it do anything with toStringz? I'm not getting this
one, toStringz already has done the work your proposal wants to do.
I was assuming the compiler could not detect the case where the programmer
is explicitly calling toStringz i.e. what would be legacy code assuming
this proposal came into effect.
This is not a 'new' problem introduced the idea, it's a general
problem for D/arrays/slices and the same happens with an append,
right? In which case it's not a reason against the idea.
It's new to the features of the C function being called. If you look
up the man page for such a hypothetical function, it might claim that
it alters the data passed in through the argument, but it seems to not
be the case! So there's no way for someone (who arguably is not well
versed in C functions if they didn't know to use toStringz) to figure
out why the code seems not to do what it says it should. Such a
programmer may blame either the implementation of the C function, or
blame the D compiler for not calling the function properly.
None of this is relevant, let me explain..
My idea is for the compiler to detect a char* parameter to an extern
"C" function and to call toStringz. When it does so it will correctly
update the slice/array being passed if reallocation occurs. The C
function will write to the slice/array being passed. So, it's not
relevant if there was another slice referencing the array before it was
reallocated, because that case is no different to calling a D function
which does something similar, like appending to the passed slice/array.
What about this case?
char buffer[12];
buffer[] = "hello, world";
ucase(buffer[]); // does nothing to buffer!
I'm saying, the charter of the function is to update a string in place,
and your proposal is making that not true in some cases.
Sure, but how is that different to this:
char buffer[12];
buffer[] = "hello, world";
ucase(buffer ~ "a"); // does nothing to buffer!
or in fact this:
char buffer[12];
buffer[] = "hello, world";
ucase(cast(char*)toStringz(buffer)); // does nothing to buffer!
in both cases buffer remains unchanged.
The goal is to make a call to an extern "C" function "just work" in the
same way as calling Win32/C functions "just work" from C# .. which also
has it's own string type.
This is very different. C#'s strings are full reference types, so
adding a '0' at the end affects all references to that string,
reallocation or not.
I'm not sure how C# is doing it under the covers, I doubt they're adding a
\0 to the string. For all I know they're making a completely new copy,
especially if the Win32 function expects ascii as I suspect C# uses
unicode internally. The implementation of C# isn't really important here,
the goal is.
toStringz does not currently check for '\0' anywhere in the existing
string. It simply appends '\0' to the end of the passed string. If
you want it to check for '\0', how far should it go? Doesn't this
also add to the overhead (looping over all chars looking for '\0')?
Note also, that toStringz has old code that used to check for "one
byte beyond" the array, but this is commented out, because it's
unreliable (could cause a segfault).
So, toStringz is not as clever as I imagined. I thought it would
intelligently detect cases where a \0 was already present in the slice
(from 0 to $) and if not, put one at $+1 (inside pre-allocated array
memory). I was assuming toStringz had access to the underlying array
allocation size and would know how far it can 'look' without causing a
segfault. In the case where the slice length equaled the array
reserved memory area, it would re-allocate and place the \0 at $+1
(inside the newly allocated memory).
s/clever/slow/
The only "intelligent" way to check for a 0 is a linear search.
Fair enough.
Without knowing where the data came from, there is no way to look past
the slice without possibly calling a segfault. If you know it's a heap
allocation, you can look at the block information to see if you can look
past it. This might be possible to do for toStringz, but the linear
check for 0 is just unacceptable for a simple function call. Appending
a 0 is at least amortized. One thing though, it could make some smarter
decisions as to whether to reallocate depending on the type of the
array, since it is already doing a lookup of block info.
Ok, scrap the linear search, or only perform it when a reallocation may be
required.
But I still always come back to the fact that I should be able to
circumvent some auto-intelligent decision that isn't aware of things
that a developer can be aware of (such as knowing an array already
contains a 0). The compiler shouldn't be too intrusive here.
Sure, we want to keep everyone happy, the Q is, to my mind, which is the
more general case.
It would be nice to have your cake and eat it too, or in other words for
the general case (as I see it):
char[] foo;
.. code which populates foo ..
ucase(foo);
to "just work" as a new D programmer might expect, at the same time I
agree that cases where speed is of the essence, or the data is guaranteed
to contain \0 we need to be able to avoid the cost. As most things it
comes down to cost/benefit and I think D would benefit from this default
behaviour, provided there is a way to avoid it as well.
Perhaps restricting the idea to cases like the one above where the
compiler has the information for the slice/array, and doing nothing for
raw char* cases is a good compromise, it would allow people to avoid the
behaviour just by adding .ptr or similar.
--
Using Opera's revolutionary email client: http://www.opera.com/mail/