https://github.com/python/cpython/commit/31a500a92b1baf49a21d15b6b5fc0b6095f1e305 commit: 31a500a92b1baf49a21d15b6b5fc0b6095f1e305 branch: main author: Neil Schemenauer <nas-git...@arctrix.com> committer: nascheme <nas-git...@arctrix.com> date: 2025-04-28T13:32:39-07:00 summary:
Add internal docs about the free-threaded GC. (gh-132562) files: M InternalDocs/garbage_collector.md diff --git a/InternalDocs/garbage_collector.md b/InternalDocs/garbage_collector.md index 6d8fe6323d3152..3b2551a6b95b14 100644 --- a/InternalDocs/garbage_collector.md +++ b/InternalDocs/garbage_collector.md @@ -477,17 +477,24 @@ specifically in a generation by calling `gc.collect(generation=NUM)`. ``` -Optimization: visiting reachable objects -======================================== +Optimization: excluding reachable objects +========================================= -An object cannot be garbage if it can be reached. +An object cannot be garbage if it can be reached. To avoid having to identify +reference cycles across the whole heap, we can reduce the amount of work done +considerably by first identifying objects reachable from objects known to be +alive. These objects are excluded from the normal cyclic detection process. -To avoid having to identify reference cycles across the whole heap, we can -reduce the amount of work done considerably by first moving most reachable objects -to the `visited` space. Empirically, most reachable objects can be reached from a -small set of global objects and local variables. -This step does much less work per object, so reduces the time spent -performing garbage collection by at least half. +The default and free-threaded build both implement this optimization but in +slightly different ways. + +Finding reachable objects for the default build GC +-------------------------------------------------- + +This works by first moving most reachable objects to the `visited` space. +Empirically, most reachable objects can be reached from a small set of global +objects and local variables. This step does much less work per object, so +reduces the time spent performing garbage collection by at least half. > [!NOTE] > Objects that are not determined to be reachable by this pass are not > necessarily @@ -515,6 +522,171 @@ increment. All objects directly referred to from those stack frames are added to the working set. Then the above algorithm is repeated, starting from step 2. + +Finding reachable objects for the free-threaded GC +-------------------------------------------------- + +Within the `gc_free_threading.c` implementation, this is known as the "mark +alive" pass or phase. It is similar in concept to what is done for the default +build GC. Rather than moving objects between double-linked lists, the +free-threaded GC uses a flag in `ob_gc_bits` to track if an object is +found to be definitely alive (not garbage). + +To find objects reachable from known alive objects, known as the "roots", the +`gc_mark_alive_from_roots()` function is used. Root objects include +`interp->sysdict` (the `sys` module dictionary), `interp->builtins`, and +`interp->types`. Also included are all objects referred to by active Python +frames. These objects and the transitive closure of objects reachable from +them have `_PyGC_BITS_ALIVE` set. Any object with that bit set is excluded +from the rest of the cyclic garbage detection process, since we know it cannot +be unreachable. + +> [!NOTE] +> If the `gc.freeze()` function has been used, this phase of the collector is +> skipped. That is done for two reasons. First, it is unlikely to be a +> performance win if most of the objects have been marked as frozen. As such, +> they would be excluded for the cyclic garbage detection, without this extra +> work anyhow. Second, one of the purposes of using `gc.freeze()` is to avoid +> dirtying the memory pages holding frozen objects. If this phase was executed, +> the toggling of the `ob_gc_bits` flags would dirty pages and defeat that. + +Software prefetch hinting +------------------------- + +To speed up the "mark alive" phase of the free-threaded GC, an additional +optimization, known as software prefetching, is used. The GC will execute +explicit prefetch CPU instructions in order to reduce the latency due to +loading data from main memory. This added complexity can pay off since main +memory is so much slower compared to accessing data in the CPU cache. This +is enabled only if the number of long-lived objects exceeds a threshold. If +the set of objects being garbage collected is small, the accessed memory is +likely to fit entirely in the CPU cache and software prefetch is not helpful. + +The details of this optimization are intricate, with the source code being the +best reference. However, the rest of this section gives a high level +description of how it works and explains some design decisions. + +Software prefetching is only used during the "mark alive" phase of the GC. +Specifically, when we are performing the transitive closure of the "alive" +status of objects (i.e. objects reachable from known alive objects, known as +roots). For each object we find, we need to traverse all objects directly +reachable from that object. If that set of referred objects is in scattered +locations of memory, the hardware prefetch is unlikely to predict the next +accessed memory location. + +Making software prefetch work well hinges on a key principle: allow enough time +between issuing the prefetch instruction for a memory location and actually +accessing that location's data. We can call that time difference the prefetch +window. If the window is too large, we fill up the CPU caches with data that's +not needed yet. Worse, the data might be discarded from the cache before we +actually use it. If the window is too small then the memory system does not +have enough time to finish loading the memory and the CPU will have to wait. +The window is indirectly tuned by the prefetch buffer parameters. The buffer +will be explained next. + +The prefetch buffer is a FIFO queue of fixed size. When we enqueue an object +reference into the buffer, we also issue a software prefetch instruction for +that memory location. When we dequeue an object reference from the buffer, we +assume or hope that enough time has elapsed so that the memory has been loaded +into the cache. This is the mechanism that provides the window. + +When performing the transitive closure of "alive" status, the set of objects +yet to visit are stored in one of two places. First, they can be stored in the +prefech buffer. Second, there is a LIFO stack, of unlimited size. When object +references are found using `tp_traverse`, they are enqueued in the buffer if +it is not full, otherwise they are pushed to the stack. + +We must take special care not to access the memory referred to by an object +pointer until we take that reference out of the prefetch buffer. That means we +cannot check if an object is a "container" (if the `PyObject_IS_GC()` test is +true) or if the object already has the "alive" flag set. Both of those tests +would require that the object memory is accessed. There are some additional +elaborations that try to keep the buffer filled to the optimal size but to keep +things simple we will omit those details here. Not discussed in details are +"spans", which help reduce the amount of stack used and make it easier to +control the size of the buffer. + +As mentioned, the prefetch window is the time delay between the issue of the +prefetch instruction, on buffer enqueue, and the memory access, after buffer +dequeue. It is tuned by adjusting some buffer parameters. If processing time +were equal for every object then the buffer length would be proportional to the +window. Since processing each object can actually take a variable amount of +time, the relation between the buffer length and the prefetch window is only +approximate. However, this proportional relationship is assumed to hold true +and we avoid the overhead of actually measuring object processing times. + +The relevant parameters are the maximum buffer size and the low and high +thresholds for filling. The buffer parameters are set as follows: maximum +length is 256, low threshold is 8, high threshold is 16. These parameters are +used as follows. If the buffer has reached the maximum size, new object +pointers found while following references are pushed to the stack, rather than +put in the buffer. When dequeuing objects from the buffer, we will "prime" the +buffer if the current length drops below the low threshold. Priming means +popping objects from the stack and enqueing them into the buffer. While +priming, we will fill it only until the high threshold is reached. + +To measure the effectiveness of the buffer, some benchmark programs were run +with a trace log of memory location prefetch and access instructions. The +prefetch window for each object processed was computed from the trace log. +Each enqueue and dequeue operation were treated as taking one unit of time. +Time to actually process the object was assumed to be zero. A histogram of the +windows is shown below. These traces suggest the buffer length is mostly being +kept between the low and high thresholds, which is what we want. Variations of +the buffer parameters were benchmarked and the best performing parameters were +chosen. Obviously it is unlikely these parameters will be optimal for all +hardware and programs. + +``` +Prefetch window stats +mean 52.1 +median 14.0 +max 256 + 25.60 |65,304 | ****************************** + 51.20 |5,590 | ** + 76.80 |3,562 | * + 102.40 |2,683 | * + 128.00 |2,278 | * + 153.60 |2,285 | * + 179.20 |2,377 | * + 204.80 |2,238 | * + 230.40 |2,753 | * + 256.00 |5,930 | ** +-------- |------- | ------- + N= |95,000 +``` + +Using software prefetch instructions is only a win if the set of objects being +examined by the GC does not fit into CPU caches. Otherwise, using the buffer +and prefetch instructions is just overhead. Using the long lived object count +seems a good estimate of if things will fit in the cache. On 64-bit platforms, +the minimum object size is 32 bytes. A 4MB L2 cache would hold about 130,000 +objects. + +The current threshold for enabling prefetch is that the number of long-lived +objects must exceed 200,000. Based on benchmarking, this seems in the range +where prefetch becomes a net gain. Obviously it depends on hardware details +and also the "shape" of the object graph. For example, your object graph may +be constructed by linearly allocating objects in memory. Then, traversing the +object graph might naturally result in mostly ordered memory access. In that +case, the hardware prefetch is likely to do a nearly perfect job without any +software prefetch hints. + +This optimization, as of March 2025, was tuned on the following hardware +platforms: + +- Apple M3 Pro, 32 GB RAM, 192+128 KB L1, 16 MB L2, compiled with Clang 19 +- AMD Ryzen 5 7600X, 64 GB RAM, 384 KB L1, 6 GB L2, 32 MB L3, compiled with GCC 12.2.0 + +Benchmarking the effectiveness of this optimization is particularly difficult. +It depends both on hardware details, like CPU cache sizes and memory latencies, +and the specifics of the program's memory access patterns, where objects are +located in memory and in what order they are accessed during the mark alive +phase. When the program's memory access patterns are favourable, working set +of data larger than the CPU cache, objects allocated in such a way that access +order is not linear, then the speedup from using software prefetching is in the +range of 20% to 40% faster for the entire full GC collection. + + Optimization: reusing fields to save memory =========================================== _______________________________________________ Python-checkins mailing list -- python-checkins@python.org To unsubscribe send an email to python-checkins-le...@python.org https://mail.python.org/mailman3/lists/python-checkins.python.org/ Member address: arch...@mail-archive.com