[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 Steven Schveighoffer changed: What|Removed |Added See Also||https://issues.dlang.org/sh ||ow_bug.cgi?id=2504 --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 Iain Buclaw changed: What|Removed |Added Priority|P1 |P4 --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 Stanislav Blinov changed: What|Removed |Added CC||stanislav.bli...@gmail.com --- Comment #13 from Stanislav Blinov --- Created attachment 1713 --> https://issues.dlang.org/attachment.cgi?id=1713=edit Example of paged allocation with GC Jumping in a year later almost to the hour, I'll start with a disclaimer: I agree with Andrei that such functionality in general is best left for a custom allocator API. That said, having a straightforward utility for chunked GC allocations in Phobos or DRuntime would benefit users who don't necessarily want to import or deal with custom allocators for "just that one allocation" (scripts, prototyping, etc.). Attached is an implementation of the sketch suggested in previous comments; see comments in the code about certain implementation/DRuntime quirks. This implementation on my machine is about 10x faster than piecewise allocation with both dmd and ldc; it's still slower, of course, than a wholesale preallocation, but at least should be on the same order of magnitude, unless you go for astronomical amounts of instances. --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 --- Comment #12 from Rainer Schuetze--- The current GC already does something in that regard as it pre-assigns a new memory page to a free-list of elements of the size of the last request. The GC locking overhead can be reduced by helping the inliner a bit with some assembly. I have tried that for Win64 recently and the benchmarks showed an improvement of about 5%. --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 --- Comment #11 from Steven Schveighoffer--- (In reply to Andrei Alexandrescu from comment #10) > This issue should be closed. A good allocator should be able to handle > repeated allocations of the same size without this kind of hint. I'm all for a solution that doesn't need hints. As far as I can tell, that solution hasn't been disclosed yet. If you know how to write a "good allocator" I really think you should do it, it would do wonders for D's GC performance. Certainly, feel free to take this ER down in the event that becomes available. --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 --- Comment #10 from Andrei Alexandrescu--- (In reply to Steven Schveighoffer from comment #8) > What I am looking for, > though, is the high-level API of "I have 20 million T's I want to allocate, > give me them properly allocated and typed one at a time". This issue should be closed. A good allocator should be able to handle repeated allocations of the same size without this kind of hint. --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 --- Comment #9 from safety0ff.bugz--- (In reply to Steven Schveighoffer from comment #8) > Internally, I think something like this is needed. What I am looking for, > though, is the high-level API of "I have 20 million T's I want to allocate, > give me them properly allocated and typed one at a time". I don't > necessarily even think you need the TypeInfo for the low level API. Perhaps > the bit attributes can even be done as you allocate each block. Here's the rough sketch of one idea: void[] gc_nmalloc(size_t num, size_t size, void[] prev = [], uint ba = 0, const TypeInfo ti = null) // prev is used for temporary pinning/unpinning a memory range struct GC // in core.memory { auto NMalloc(T)(size_t num) { // convenience function int loop(int delegate(T*) dg) { void[] prev; size_t remain = num; scope(exit) gc_nmalloc(0,T.sizeof,prev); // unpin last chunk int result; while (remain) { prev = gc_nmalloc(remain, T.sizeof, prev, ...); auto chunk = cast(T*)prev.ptr[0 .. prev.length/T.sizeof]; foreach (ref e; chunk) if ((result = dg())) return result; remain -= chunk.length; } return result; } return } } So user code could look like: foreach (e; GC.NMalloc(T)(20_000_000)) // do something with allocated e --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 --- Comment #8 from Steven Schveighoffer--- (In reply to safety0ff.bugz from comment #7) > I think the API should obviously should have the size of the elements as a > separate arguments and provide as much information as malloc e.g.: > GC.preload(size_t num, size_t size, uint ba = 0, const TypeInfo ti = null) Internally, I think something like this is needed. What I am looking for, though, is the high-level API of "I have 20 million T's I want to allocate, give me them properly allocated and typed one at a time". I don't necessarily even think you need the TypeInfo for the low level API. Perhaps the bit attributes can even be done as you allocate each block. --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 safety0ff.bugzchanged: What|Removed |Added CC||safety0ff.b...@gmail.com --- Comment #7 from safety0ff.bugz --- (In reply to Steven Schveighoffer from comment #6) > > If we add thread local free-lists to the GC, the overhead of allocating > > these from the GC instead of caching them in the AA would be rather small. > > Agreed, I think keeping the lists inside the GC is the most useful, and does > not expose any implementation details to the user. I think this is the only sane way of doing it. i.e. without over constraining the GC implementation or exposing the user to implementation details. I think the API should obviously should have the size of the elements as a separate arguments and provide as much information as malloc e.g.: GC.preload(size_t num, size_t size, uint ba = 0, const TypeInfo ti = null) Otherwise it relies on hope that the GC will pull from the same pool. --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 Alexandru Razvan Caciulescuchanged: What|Removed |Added CC||alexandru.razva...@gmail.co ||m Assignee|nob...@puremagic.com|alexandru.razva...@gmail.co ||m --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 --- Comment #6 from Steven Schveighoffer--- (In reply to Rainer Schuetze from comment #5) > A slightly simpler API could be to add allocating N same-sized chunks from > the GC that returns them in a free-list-like chain. I think an array is best, since the data already is an array, and a free-list requires building a linked list before-hand. > I agree with Andrei that we should not complicate the GC interface for every > possible allocation pattern a user might want to optimize for, though. Hm... I'm trying to find the most generic interface for this. It's not necessarily limited to AA, as allocating a bunch of blocks in a loop isn't an uncommon thing. If there was a way to "re-flavor" the blocks from one giant block into individual ones, then we could do this outside the GC. > If you call GC.reserve(20_000_000*(Key.sizeof+Value.sizeof)) before > inserting elements, there should be no collection while filling the AA. Limiting the collection in the example I posted shaves off about 300-400msec, but it still is 1.5 seconds vs. 0.1 for the local array version. > If we add thread local free-lists to the GC, the overhead of allocating > these from the GC instead of caching them in the AA would be rather small. Agreed, I think keeping the lists inside the GC is the most useful, and does not expose any implementation details to the user. --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 Rainer Schuetzechanged: What|Removed |Added CC||r.sagita...@gmx.de --- Comment #5 from Rainer Schuetze --- A slightly simpler API could be to add allocating N same-sized chunks from the GC that returns them in a free-list-like chain. I agree with Andrei that we should not complicate the GC interface for every possible allocation pattern a user might want to optimize for, though. If you call GC.reserve(20_000_000*(Key.sizeof+Value.sizeof)) before inserting elements, there should be no collection while filling the AA. If we add thread local free-lists to the GC, the overhead of allocating these from the GC instead of caching them in the AA would be rather small. --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 --- Comment #4 from Steven Schveighoffer--- Stevens-MacBook-Pro:testd steves$ cat testpreallocate.d struct S { S *next; } void main() { version(freelist) { S* head = null; foreach(i; 0 .. 20_000_000) head = new S(head); } version(preallocate) { S* head = null; auto x = new S[20_000_000]; foreach(i; 0 .. 20_000_000) { x[i] = S(head); head = &(x[i]); } } } Stevens-MacBook-Pro:testd steves$ dmd -O -release testpreallocate.d -version=freelist Stevens-MacBook-Pro:testd steves$ time ./testpreallocate real0m1.869s user0m1.750s sys0m0.114s Stevens-MacBook-Pro:testd steves$ dmd -O -release testpreallocate.d -version=preallocate Stevens-MacBook-Pro:testd steves$ time ./testpreallocate real0m0.111s user0m0.045s sys0m0.062s The point is that the GC is not well-equipped to handle the tight allocation loop. The second version has the drawback that all 20 million elements will remain in memory as long as there is one element still alive. What I'm looking for is something that has the performance (or similar, I realize it won't be as good) of the second, but can collect each block individually like the first. --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 --- Comment #3 from Andrei Alexandrescu--- >The use case is different than for GC.allocate. The semi-joke was this is all needed. >I want to fill in a structure of nodes, I know I'm going to fill it in with >10,000 elements, so I'm going to allocate them all in a loop. But I don't want >the GC to keep the ENTIRE thing in memory if just one element is still pointed >at. And I don't want to run the GC constantly in my tight loop allocating each >node. The solution is to prefill a freelist, then let go of the extra elements remaining. >Think of a tree constructor, or an AA constructor. >Essentially it's the same as array.reserve, but for individual blocks instead >of a contiguous single block. That's not the job of the allocator. It's the job of the user of the allocator. The allocator gives you memory. You organize it as you find fit. --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 --- Comment #2 from Steven Schveighoffer--- The use case is different than for GC.allocate. I want to fill in a structure of nodes, I know I'm going to fill it in with 10,000 elements, so I'm going to allocate them all in a loop. But I don't want the GC to keep the ENTIRE thing in memory if just one element is still pointed at. And I don't want to run the GC constantly in my tight loop allocating each node. Think of a tree constructor, or an AA constructor. Essentially it's the same as array.reserve, but for individual blocks instead of a contiguous single block. --
[Issue 17881] Provide mechanism to preallocate memory from the GC
https://issues.dlang.org/show_bug.cgi?id=17881 Andrei Alexandrescuchanged: What|Removed |Added CC||and...@erdani.com --- Comment #1 from Andrei Alexandrescu --- This seems like a tall order. It's essentially pushing a specific management need on to the GC, complicating everyone's life in the process. If an application needs to preallocate, there's an API for that already - it's called GC.allocate() :o). Then the application needs to do its own bookkeeping. --