The great slice debate -- should slices be separated from arrays?

Steven Schveighoffer Tue, 24 Nov 2009 06:30:15 -0800

In many other posts, people have been festering over dropping T[new] andnot having a reference array type. The argument is (and forgive me ifthis is a strawman, someone from the other side can post if I'mincorrect): If we make arrays a separate type from slices, and only allowappending on arrays, then we solve the stomping problem and thehard-to-determine reallocating problem. For those who are unfamiliar withthese problems, I'll try to outline them at the bottom of the post.

I contend that even if we make arrays a separate type, even if arrays aremade a true reference type, slices will still suffer from the samehard-to-determine reallocation problem *unless* we make slices fatter.

My proof is as simple as an example. Assume 'Array' is the new type foran array:


auto a = new Array!(int)(15); // allocate an array of 15 integers, all 0
auto s = a[0..5];
a ~= [1,2,3,4,5];
a[0] = 1.

Now, what does s[0] equal?

It depends on how s is defined. If s is a simple pointer + length sliceas defined today, then it is *STILL* hard-to-determine because you areunsure whether a had to reallocate to a new block! If you make s a "fat"slice, that is, it contains a reference to the *original* array, and 2indexes, then you have issues:

1. you are now passing around 12 (24 for 64-bit) bytes for a slice insteadof 8 (16), that hurts performance2. you are now subject to a slice becoming invalid if a decides toreallocate into a smaller block and your slice indexes data that is nowoutside the block.3. you are now subject to corruption if the array decides to truncate fromthe *front* of it's block, since the slice will now refer to differentdata.

Because of points 2 and 3, you lose static provability of slices beingvalid, since their backing array can reallocate at any time.

Are there any alternative implementations for slices that work better? Ifnot, I think thin slices (pointer and length) are the way to go, whichleaves us with a still hard-to-determine reallocation strategy. In thiscase, why not leave the arrays the way they are?

The answer is, stomping. But we have already found ways to fix thatwithout changing any code. In the instance where you need the fastestappending possible, we can create a library type to handle that, then youconvert to an actual array when you need it. I think dsichma already isworking on such a type.

So is there anyone in the "please separate arrays from slices" camp thatcan counter this? Are there other solutions someone can think of or hasalready brought up that I didn't mention?


======================

Definition of the "hard-to-determine" reallocating problem: (I'd classifythis as non-deterministic reallocating, but it depends on your point ofview, and Walter has fits over that term :)

If you reallocate an array, it may or may not copy the data to anothermemory block depending on if it can resize into the existing memoryblock. If it can, then the original memory is still reference by thearray. The determination of when this occurs is dependent on theimplementation of the runtime. In the current incarnation of dmd, thisoccurs when the array is sized beyond powers of 2 up to a page size, andthen at page size increments.


The trouble is, when you see code like this:

void foo(char[] buf)
{
   buf ~= "abcd";
   buf[0] = 'z';
}

You cannot tell whether the input array was affected because the appendstatement may or may not change the address of buf. If it doesn't, thenthe argument passed in is affected. If it does, the argument passed in isnot affected. The current solutions to this problem are to eitherdocument the behavior or dup the incoming buffer before appending (orusing the buf = buf ~ "abcd" form which guarantees reallocation).


======================

Definition of the stomping problem:

If you append to a slice which *starts* at the beginning of aheap-allocated memory block, and the append operation fits within thememory block, the compiler reallocates in place. However, if that slicewas a smaller piece of a larger array, it will overwrite the data in thearray. The optimization came into effect when people noticed code likethe following was very slow and ate up too much memory:


int[] x;

for(int i = 0; i < 10000; i++)
   x ~= i;

If x reallocates on every loop, it will allocate 10000 times, each timeallocating a larger memory block. This forces more GC collection cyclesand can leave behind scores of unused pages of memory if the collectordoesn't reuse or allow the OS to reclaim them.

The optimization depends on the likelihood of a slice that points to thebeginning of a block being the owner of the memory block (that is, allother slices were created from it, and therefore do not go beyond theextents of the primary slice). This is certainly the case after the firstreallocation that occurs in the loop above, making this loop perfectlysafe. However, it is not hard to come up with an unsafe case:


int[] x = [1,2,3];
auto y = x[0..1]; // slice only the first element
y ~= 5; // now x == [1,5,3]

This is even more disturbing for const or immutable data:

string x = "hello".idup; // need the idup to put the data on the heap.
auto y = x[0..1]; // slice only "h";

y ~= "owdy"; // now x, which is supposed to be immutable, changed to"howdy"


This stomping becomes more of a problem for library writers:

char * toStringZ(string s)
{
   s ~= '\0';
   return s.ptr;
}

This seemingly innocuous function could easily corrupt immutable data if sis a slice of the beginning of a larger array. Therefore, toStringZ hasto be defensive:


char * toStringZ(string s)
{
   s = s ~ '\0'; // not using ~= operator forces a reallocation
   return s.ptr;
}

But this "defensive" style would be unnecessary if stomping could notoccur.

The great slice debate -- should slices be separated from arrays?

Reply via email to