On Thu, 24 Dec 2009 13:39:21 -0500, grauzone <[email protected]> wrote:


Sorry about that, I didn't have a close look at the patch. I guess I was more thinking about Andrei's original proposal (and how I thought it may be implemented).

No problem.


It seems you store the length field inside the array's memory block (instead of the cache, which just speeds up gc_query). That's awesome, but I'm going to complain again: now you have to keep a length field for *all* memory allocations, not only arrays! For most object allocations, this means 4 more bytes additional overhead.

Interestingly enough, the storage overhead is zero except for memory blocks > 256 bytes. I'll explain:

A probably not well-known piece of trivia is that D allocates 1 extra byte for arrays. Why would it do this you say? Because of the GC.

Imagine that it does not do this, what happens when you do something like this:

ubyte[] array = new ubyte[16]; // allocates a 16-byte block for the array
array = array[$..$];

If you look at array's ptr member, it now no longer points to the allocated block, but the next block. Although it isn't important for the GC to keep around the allocated block any more, it now will keep the next block from being collected. In addition, if you tried appending to array, it might start using unallocated memory!

So the byte at the end already was in use. I sort of commandeered it for length use. For blocks up to and including 256 bytes, I use the last byte of the block as the length storage. For blocks of 512 to a half page (2048 bytes), I use the last 2 bytes, so there is one extra overhead byte compared to the current implementation.

Blocks larger than that follow different rules, they are not stored in bins, but just a whole page at a time. With those blocks, it is possible to extend the block by adding more pages if they are free, so it's not OK to store the length at the end of the block, since the end of the block may change. So I store at the beginning. I use up a full 8 bytes, and the reason is alignment, I don't know what you're putting in the array, so I must keep the data 8 byte-aligned.

For classes, I allocate the required extra data as if the class were an array of the class data size, and then set the "ghost" length at the end of the block to 0. If a class data exceeds a half page, the ghost length is at the beginning, where the vtable ptr is, so it's extremely unlikely to accidentally match that length. Note that it doesn't make a huge difference in most cases because the block used for the class is a power of 2 anyways, so in most cases there's plenty of wasted space at the end.

I found out during testing that allocating a new struct is equivalent to allocating a new array of that struct of size 1, and returning it's pointer, so that aspect is already covered.

Also, if you use GC.malloc directly, and the user tries to append to slices to it, your code may break. GC.malloc doesn't seem to pad the memory block with a length field.

Yes, this is a possible problem. However, using GC.malloc directly and then treating the result as a normal array is probably extremely rare. At least it is not valid for safe D. It probably should be explained as a danger in GC.malloc, but I don't think this will adversely affect most code. There will be functions that should obviate the need for calling GC.malloc to allocate an array (specifically, allocating uninitialized data).

I must say that I find your argumentation strange: didn't you say adding an additional field to the slice struct is too much overhead?

Overhead when passing around, not overhead for storage. For example, pushing 12 bytes onto the stack instead of 8 when calling a function with a slice. If you want safe appends, you need to store the allocation length somewhere, there's no way around that.

Also, a solution which keeps the pointer to the array length in the slice struct would still be faster. The cache lookup cost is not zero.

This is the trade-off I chose between performance when passing around slices and using them and performance for appending. If you focus all your performance concerns on appending, then other areas suffer. I don't know if what I chose will be the best solution, but I can't think of any way to have *both* speedy slice usage and near-zero append overhead. If someone can think of a better solution, I'll be happy to incorporate it, but after using and understanding slices while developing stuff for Tango (Tango uses slices to get every ounce of performance!), I'm convinced that as-fast-as-possible slice semantics for passing around data is essential.

This is where a custom type that focuses performance on appending at the expense of other functions is good to have. I think such applications where you need the absolute best append performance without any other functionality are pretty rare.

Anyway, now that semantics and performance are somewhat sane, none of these remaining issues are too important. Thanks Steve and Andrei!

Cool!

-Steve

Reply via email to