2026年5月17日(日) 0:19 Go Kudo <[email protected]>: > Hi internals, > > I'd like to start the discussion for a new RFC, OPcache Static Cache. > > RFC: https://wiki.php.net/rfc/opcache_static_cache > Implementation: https://github.com/php/php-src/pull/22052 > > The proposal adds an OPcache-managed shared-memory cache for explicit > userland values and for selected PHP static state. It introduces explicit > functions under the OPcache namespace (volatile_* and persistent_*) and two > attributes, #[OPcache\VolatileStatic] and #[OPcache\PersistentStatic], that > let selected static properties and method static variables survive across > requests. The feature is disabled by default and only activates once memory > is allocated through the new INI directives. > > The RFC covers the motivation, the deliberate split between the two > backends, the trust model (one PHP runtime = one trust domain; this is not > a tenant isolation boundary), and benchmarks against APCu on NTS php-fpm > and ZTS FrankenPHP. The PR is the full implementation, with PHPT coverage > summarized in the Validation section. > > One thing to flag on the implementation status: the Windows build is > currently broken. I don't have a Windows development environment available > yet — one is being arranged through work, and I'll get the Windows side > fixed once that's in place. > > Feedback welcome. > > Best Regards, > Go Kudo >
Hi Nicolas, Jakub, Timo, Larry I update RFC and Implementation: RFC: https://wiki.php.net/rfc/opcache_static_cache PR: https://github.com/php/php-src/pull/22052 I'm folding replies to all three of you into one message, since the threads overlap. Most of it answers Nicolas's measurements; further down there is a section for Jakub's FPM pool-isolation concern and a short note for Timo's pointer to prior art. Nicolas, thank you for building my branch and running your own A/B/C measurements. That moved the discussion onto concrete ground, and I appreciate it. Since your review I have pushed a revised branch and bumped the RFC to 2.0.0. The API changes discussed below are in it (the SAPI opt-in model, and `getCacheStoreType()` for storage-path visibility), and the object workloads you flagged are now substantially faster: native now beats the deepclone path on every nested case I tried. Details and numbers follow. I agree with most of your points. I'll go through them in order, concede the ones where you are right, and try to narrow what is left. I think it comes down to one question: whether a userland array-hydration layer is an acceptable replacement for engine-level object storage. Most of the rest I can give you. ## The resulting public API For reference, here is the shape the explicit API settled into, summarised from the stub: ```php namespace OPcache; // Explicit cache: two final classes, static methods only, no instances. final class VolatileCache { public static function get(string $key, null|bool|int|float|string|array|object $default = null): null|bool|int|float|string|array|object; public static function getMultiple(array $keys, ?array $default = null): array|false; public static function set(string $key, null|bool|int|float|string|array|object $value, int $ttl = 0): bool; public static function setMultiple(array $values, int $ttl = 0): bool; public static function has(string $key): bool; public static function delete(string $key_or_class): bool; public static function deleteMultiple(array $keys): bool; public static function clear(): bool; public static function lock(string $key, int $lease = 0): bool; public static function unlock(string $key): bool; public static function getCacheStoreType(string $key_or_property, ?string $class_name = null): CacheStoreType; public static function info(): StaticCacheInfo; } // PinnedCache is the same set, except set()/setMultiple() take no $ttl, // plus two atomic counters: final class PinnedCache { // get/getMultiple/set/setMultiple/has/delete/deleteMultiple/clear/ // lock/unlock/getCacheStoreType/info -- as above public static function increment(string $key, int $step = 1): int|false; public static function decrement(string $key, int $step = 1): int|false; } // getCacheStoreType() reports how a value is stored, without decoding it: enum CacheStoreType { case NotFound; // no entry for the key/property case Scalar; // stored inline case SharedGraph; // zero-copy graph laid out in SHM (the fast path) case OPcacheSerialized; // OPcache binary serializer (SHM-safe, no userland) case PHPSerialized; // php_var_serialize() last resort } // Declarative static state, over the same storage: #[Attribute] final class VolatileStatic { public function __construct(int $ttl = 0, CacheStrategy $strategy = CacheStrategy::Immediate); } #[Attribute] final class PinnedStatic {} enum CacheStrategy: int { case Immediate = 0; case Tracking = 1; } // Status object and the single exception type: final readonly class StaticCacheInfo { /* enabled, available, configured_memory, entry_count, ... */ } class StaticCacheException extends \Exception {} ``` Two final classes with static methods, no instances and no shared interface. Misses and contention return the default or `false`; genuine backend failures return `false` (or `int|false` for the atomic counters); `Closure` and resource values are rejected with a `TypeError`; and `StaticCacheException` is reserved for strict `#[OPcache\PinnedStatic]` publication. ## SAPI availability: the unsafe flag is gone, opt-in instead > these are safe SAPIs, they just don't have a scoping concept built in > [...] enable it by default with a single default scope for those SAPIs, > plus a clear internal API so a SAPI can define its own scoped segments I implemented it the way you suggested. There is no longer an `opcache.static_cache.allow_unsafe_runtime` directive and no SAPI-name allowlist in the engine. Availability is opt-in: a SAPI, or an embedder, calls a small internal C API, `zend_opcache_static_cache_opt_in()`, before request handling to enable Static Cache for its runtime. That call is the runtime declaring that a trust/storage boundary holds for the lifetime of the shared-memory owner. The bundled `fpm`, `cli`, `cli-server` and `phpdbg` SAPIs call it at startup, so they are available by default. The difference from before is the mechanism: instead of the engine guessing from the SAPI name and offering an "unsafe" override, each runtime states that it owns a boundary. A runtime with a real per-tenant boundary scopes it with the partition API (`zend_opcache_static_cache_partition_create` / `_activate`, which `fpm` already uses per pool). A runtime without one, such as a shared multi-tenant web SAPI with no pre-request identity, never opts in and stays unavailable, with nothing left to misconfigure. The `embed` SAPI does not auto-opt-in, on purpose. The embedding application owns the runtime and its trust boundary, so it opts in from its own startup code. That keeps the rule consistent for every embedder, including one that registers its own SAPI module instead of reusing the bundled `embed` one. FrankenPHP does exactly that, so it opts in with the same one-line call (or a scoped partition when it isolates per worker); there is no `embed` special-case that covers `php_embed` users but silently misses FrankenPHP. That is your internal-API point, and it removes the naming question by deleting the flag entirely. The full ext/opcache suite passes with the directive gone. ## API shape: remember() > I could also add VolatileCache::remember($key, $compute, $ttl = 0) > wrapping the safe lock -> build-outside-the-lock -> store sequence I would rather not add this one. `remember()` takes a callable, and to actually prevent a stampede it has to hold the entry lock across the call to `$compute()`. That means running arbitrary userland PHP while holding a cross-process SHM lock. The callable can run unbounded, throw, fork, or re-enter the cache, and a re-entrant `lock()` on the same key (or a key in the same lock stripe) while the lock is held is a deadlock. The lease bounds the duration, but not the re-entrancy and not the exception path. Not holding the lock while computing gives no stampede protection at all; it is then just sugar over `get()`-then-`set()` that looks atomic, which is worse than not having it. Since I already expose `lock()`/`unlock()` with a lease, userland can do the safe thing itself, with the compute step outside any engine lock: ```php if (!VolatileCache::lock($key, $lease)) { return VolatileCache::get($key, $default); } try { $value = $compute(); // runs outside the engine lock VolatileCache::set($key, $value, $ttl); return $value; } finally { VolatileCache::unlock($key); } ``` That keeps the closure's execution, its scope, and any exception it throws in userland, never inside the engine's critical section. I would rather document this recipe than move userland execution into the primitive. If you see a safe construction I have missed, I will reconsider. ## References and the silent fallback > I'd rather make it visible (surface the chosen path in info(), or in a > debug build) than ban objects Agreed, and that is implemented: visibility, not a ban. There is a new introspection method on both cache classes: ```php VolatileCache::getCacheStoreType(string $key_or_property, ?string $class_name = null): OPcache\CacheStoreType PinnedCache::getCacheStoreType(string $key_or_property, ?string $class_name = null): OPcache\CacheStoreType ``` It returns an `OPcache\CacheStoreType` enum (`NotFound`, `Scalar`, `SharedGraph`, `OPcacheSerialized`, `PHPSerialized`), so you can see per key which path a value took, without decoding it, in any build rather than only a debug one. Passing `$class_name` inspects the attribute-backed static-property storage for that class instead of an explicit key. A value that fell back to serialization is now one call away from being observable. The enum also pins down a correction. The first fallback off the shared graph is not `php_var_serialize` but the OPcache binary serializer, which is SHM-safe and runs no userland code. That is why `getCacheStoreType` reports `OPcacheSerialized` and `PHPSerialized` as separate cases; `php_var_serialize` is the last resort, not the first. So "bail == APCu parity" understates the middle tier, though your underlying point holds: even that tier is slower than the fast path and should be visible. > no real objection to rejecting top-level hard refs up front [...] > "top-level hard ref" confuses me You are right to be confused, and I will retract the phrase; it is a no-op. `store($key, $value)` takes `$value` by value, so the engine dereferences any top-level reference (`ZVAL_DEREF`) before storage ever sees it. A top-level hard ref cannot reach the storage layer as a reference. The case that matters is a nested reference, a `&` inside an array element or object property, and that cannot be rejected cheaply up front: detecting it requires walking the whole graph, which is the walk the shared-graph builder already does. So the honest answer for nested refs is the visibility above (the value reports the serialize path), not an up-front rejection. ## Scalars and arrays-of-scalars only This is where the discussion helped most. I argued before that scalars-only gave up a real win; you pushed back with measurements; so I built your setup and measured it properly, including the large nested workloads that are the actual case for a cache. You were right that native was losing. That sent me into the implementation, and I found the cause and fixed it. The path is worth setting out. Two of your framings I agree with up front: 1. For array-of-scalars config/metadata, an immutable interned array is essentially free, and the cache should not claim to beat it. 2. The "Nx faster than APCu" headline is size-dependent; APCu is only a few microseconds for small payloads. ### (a) The config array > an immutable array is essentially free (0.045 us) [...] the static > cache's own array fetch, which pays an O(n) walk per read and so doesn't > even deliver the immutable-array win that opcache literals already give You are structurally right, and I have fixed it. Two facts first. I could not reproduce 331 us: a pure-scalar 4k-entry array fetches in about 7 us, scaling at roughly 1.7 ns/entry, and the decode itself was already zero-copy (a scalar array is stored once as `IS_ARRAY_IMMUTABLE` and returned as `ZVAL_ARR()` straight into SHM). The O(n) you felt was one layer up: every warm fetch re-walked the array in `value_needs_request_local_clone()` to decide whether it needed a deep clone, when that answer is fixed at store time. I removed that walk for shared-graph values (the same change as in (c)); the 4k fetch is now about 0.64 us and flat in the entry count. It is still not the 0.014 us of a resident literal read, and I am not claiming it should be. For read-only scalar config the preload/literal path wins, and that is fine. It is a separate matter from objects. ### (b) Objects: I measured your A/B/C, found native losing, and chased why I built this branch with APCu master and your deepclone, all NTS, JIT off, timing warm fetches where C rebuilds the same isolated object graph B returns (resident dehydrated array plus `deepclone_from_array`). As you said, native lost, and worse as the graph grew. us/op: ``` array of nested ORM entities objects A apcu B native C hydrate 1000 1800 799 501 2000 4171 1903 1043 object tree 8191 1582 1736 498 9841 1928 1836 523 ``` Two things you were right about that I had wrong: `deepclone_to_array` / `deepclone_from_array` are generic (no per-class hydrator to charge for), and C hands back the same isolated objects B does. So this was a real loss, not a measurement artifact. The cause was structural, but not where I first guessed. The warm fetch kept a request-local prototype of the materialized graph and deep-cloned it on every repeat fetch, and for an object graph that clone is slower than decoding the compact SHM layout again. A shared graph never holds shared identity or cycles, so each decode is already an independent copy; the prototype was pure overhead. On top of that the decoder re-resolved the class (`zend_lookup_class`) for every object, and the builder stored a separate copy of each repeated class and property name. ### (c) The fix Three changes, all behind the existing API, with no visible behaviour or format change: - Skip the request-local prototype for shared-graph values and decode from SHM on each fetch. (This also removes the O(n) array walk in (a).) - Deduplicate equal strings within a payload at build time, so a class or property name repeated across thousands of objects is stored once. - Memoize the resolved class per (buffer, offset) during a decode, so a homogeneous graph resolves its class once, not once per node. Same A/B/C after the change, NTS, JIT off, us/op: ``` array of nested ORM entities objects A apcu B native C hydrate 1000 1781 357 492 2000 3868 721 1036 object tree 8191 1565 462 485 9841 1830 499 513 ``` Native now beats deepclone on every nested workload I tried: about 1.4x on the 2000-entity array, and the deep trees that lost 3.5x now win. The 400-object case went from 72 to 23 us. The full ext/opcache suite passes, plus new regression tests, on NTS and ZTS. To make this reproducible on your terms, I added a deepclone backend to my own HTTP benchmark harness (dehydrate with `deepclone_to_array()`, keep the array in the volatile cache, rehydrate with `deepclone_from_array()` on each fetch) and re-ran `vote_read_long` under the published conditions (php-fpm + nginx NTS and FrankenPHP ZTS, 20 iterations / 3 warmup / 3000 ops, JIT off). The APCu baselines match the published table within about 2%, so the runtimes are comparable. native vs deepclone, mean us/op (NTS): ``` workload APCu native deepclone route_table_read 161.2 0.90 0.91 (array: tie) large_array 90.9 0.88 0.88 (array: tie) metadata_object_read 185.3 1.12 1.32 (native) metadata_object_mutate 162.4 1.03 1.19 (native) safe_direct_object 2.5 1.22 3.03 (native; deepclone slower than APCu) carbon_datetime_object 185.4 46.0 166.3 (native, ~3.6x) spl_collection_object 21.0 5.48 1.89 (deepclone) ``` So under the RFC's own methodology native is faster than the deepclone path on every object workload except SPL collections, and ties on arrays. The SPL case is the one real win for deepclone, and it is specific: those classes go through the safe-direct serialized path, whose per-fetch copy handler is heavier than rebuilding from a flat array. I have noted it in the RFC as a concrete follow-up (a tighter SPL copy handler); it does not change the overall picture. The updated tables are in the RFC. Honest edges remain: for a tiny object deepclone's tight path is a hair faster (sub-microsecond), and for read-only scalar config a resident literal still wins outright, as in (a). But for the workload this feature is actually for, large nested object graphs from a database, in-engine storage is now the faster option. ### (d) Not just performance This does not rest on performance alone. Object support is also useful for being built in and generic (no third-party extension, nothing to pre-generate) and for being one primitive: the store side and the runtime cross-worker sharing live in the same place, instead of "cache the array" plus "hydrate in userland" wired together by every library. And the safe-direct registry is not a userland protocol: a plain user object with no magic and no cycles or refs takes the fast path automatically via `can_restore_direct()`, and the C-only registry only covers a few internal classes whose state the generic path cannot read. Keeping objects imposes nothing on the ecosystem. ## Dropping pinned (and the attributes) > PinnedStatic on the Carbon shape is ~1.5 us [...] there's no preload > trick that reaches that number, because preload can't bake a live object > graph into an opcode literal Pinned is the one place a live-object representation still wins clearly, for a reason the volatile numbers above do not capture. Pinned (and `#[PinnedStatic]`) materialize the graph once per worker; after that it is a plain static read on every subsequent request in that worker, near zero per request. The hydration approach pays its hydrate cost on every request instead. preload cannot reach this either: it can only intern scalar and array literals, not bake a live object graph into an opcode literal. The caveat is that this holds for read-only / immutable shared state, where keeping one live instance across requests is correct; a mutable shared instance would leak between requests. But that is a real and common case: a compiled DI container, a routing table, config value objects. Your request-registry counter rebuilds per request from the cache, so it does not reach the per-worker amortization, and for the read-only data where it would help, pinned already does it with less per-request cost. The attributes are the ergonomic surface over that same mechanism, so I would keep them in this RFC rather than split them out. They add no new storage model; they remove the explicit store/fetch boilerplate for the static-state case. ## Where this leaves us What is already done or committed: the SAPI opt-in model (the `allow_unsafe_runtime` flag and the SAPI allowlist are gone, replaced by the internal opt-in/partition API); the error model; storage-path visibility via `getCacheStoreType()`; dropping the "top-level ref" idea; the config-array fix (skipping the request-local prototype for shared graphs, which removes the per-fetch array walk so a warm scalar-array fetch is zero-copy); and the large-nested object path from (d), with numbers on this same A/B/C. I am declining `remember()`, for the lock-safety reason above. On the central question I went where the measurements led. You were right that native lost as shipped; I found why (a request-local prototype clone slower than re-decoding, plus per-object class lookups and duplicated strings), fixed all three, and native now beats your deepclone path on the nested object workloads, with the full opcache suite and new regression tests passing on NTS and ZTS. For tiny objects deepclone is still a hair ahead, and for read-only scalar config a resident literal still wins; I concede both. So I do think in-engine object storage earns its place now, on performance and on being a built-in, generic, single primitive (and on pinned's per-worker amortization for read-only state). But if the body still prefers a focused better-APCu plus a core hydration primitive, that is an outcome I can support; the capability matters to me more than where it sits, and the work above transfers either way. The revised branch is pushed and the harness is published, so you can check the numbers directly; I will also post the full before/after A/B/C here. If you have a methodology you would prefer, I will run that too. Thanks again. This got much sharper because you measured it, and it sent me to a fix I would not have found otherwise. ## Jakub: the FPM pool boundary is preserved > The FPM shared hosting part is a problem [...] we consider data leaks > between pools as security issues [...] Maybe the solution would be to > allow it only if there is one pool enabled. This is the concern I most wanted to get right, and I think the implementation answers it without the single-pool restriction. Static Cache is not one cache shared across pools. FPM creates a separate partition per worker pool in the master, before any worker forks; each partition owns its own volatile and pinned shared-memory backend, and each worker activates only its own pool's partition during child initialization, before user code runs. Every cache API, status call, clear, and the Static Cache part of `opcache_reset()` operates on the active pool's partition. There is no API path from one pool to another pool's data, so the pool boundary stays a security boundary and no policy change is needed. If a pool's partition fails to start it gets no Static Cache; it never falls back to a shared one. One honest caveat, for the record: the per-pool segments are anonymous shared mappings created in the master before fork, so a worker inherits every pool's segment in its address space even though it can only ever address its own pool's partition. That is the same exposure model as the main OPcache SHM, which is already shared across pools today; the Static Cache is in fact more isolated, because it is logically partitioned per pool where the script cache is not. The data-leak-through-the-feature case you raised, one pool reading another's cached values through the API, does not exist in this design. If on top of that we want address-space isolation, so a worker cannot even see another pool's bytes, that is a worthwhile hardening (per-pool named segments mapped only in that pool's children, or unmapping the others post-fork), and I am happy to do it as a follow-up if you consider it in scope. Your single-pool suggestion would also work, but per-pool partitions keep the feature usable for the multi-pool shared-hosting setups where a single-cache design would otherwise be unacceptable. ## Timo: thanks for the immutable_cache pointer > See also Tyson's php-immutable_cache [...] related APCu discussions Thank you. Tyson told me about `immutable_cache` himself a while ago, and it shaped my thinking here. I built an internal extension along the same lines, `colopl_cache`, an APCu-style drop-in for immutable values. What that work showed me is that the parts that matter most for this use case (OPcache compatibility, behaviour under a JIT-heavy workload, and the Zend VM intervention needed for static-state caching) are very hard to get right as an ordinary extension. That is why I brought this to OPcache as an RFC instead of shipping another extension: it needs cooperation from the engine, the VM, and a few internal classes that an extension cannot coordinate cleanly. So the prior art is genuinely appreciated; it is part of how I arrived here. Best regards, Go Kudo
