[
https://issues.apache.org/jira/browse/IGNITE-28606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ilya Shishkov updated IGNITE-28606:
-----------------------------------
Epic Link: (was: IGNITE-25881)
> СacheObject.prepareMarshal is not thread-safe — concurrent callers duplicate
> marshalling work
> ---------------------------------------------------------------------------------------------
>
> Key: IGNITE-28606
> URL: https://issues.apache.org/jira/browse/IGNITE-28606
> Project: Ignite
> Issue Type: Task
> Reporter: Alex Abashev
> Assignee: Anton Vinogradov
> Priority: Minor
> Labels: IEP-132, ise
>
> h2. Summary
> {{KeyCacheObjectImpl.prepareMarshal(CacheObjectValueContext)}} and
> {{CacheObjectImpl.prepareMarshal(...)}} use an unsynchronized {{if (valBytes
> == null) valBytes = ctx.marshal(val);}} as their idempotency check. Under
> concurrent access — e.g. when the same {{CacheObject}} instance is referenced
> from multiple transaction entries that are marshalled from different threads
> — two callers can both observe {{valBytes == null}}, both invoke
> {{ctx.marshal(val)}}, and race to publish their result. The work is
> duplicated (CPU + allocations) and the final field value depends on write
> ordering.
> The race is functionally benign today (see below) but it is a real
> maintenance hazard: any future audit, assertion, or instrumentation that
> assumes "marshal happens once per CO" will fire spuriously, and on a more
> aggressive memory allocator the unsynchronized publish could become an actual
> data race rather than just wasted work.
> h2. Where observed
> While prototyping a debug-only assertion that flips a transient flag inside
> {{CacheObjectAdapter}} on each {{prepareMarshal}} call, the assertion fired
> under {{IgniteCacheQueryMultiThreadedSelfTest}} from the following call site:
> {noformat}
> java.lang.AssertionError: prepareMarshal called more than once on
> KeyCacheObjectImpl
> at
> org.apache.ignite.internal.processors.cache.KeyCacheObjectImpl.prepareMarshal(KeyCacheObjectImpl.java:119)
> at
> org.apache.ignite.internal.processors.cache.transactions.IgniteTxEntry.marshal(IgniteTxEntry.java:946)
> at
> org.apache.ignite.internal.processors.cache.GridCacheMessage.marshalTx(GridCacheMessage.java:380)
> at
> org.apache.ignite.internal.processors.cache.distributed.GridDistributedTxPrepareRequest.prepareMarshal(GridDistributedTxPrepareRequest.java:379)
> at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onSend(GridCacheIoManager.java:1152)
> at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1223)
> ...
> {noformat}
> The diagnostic was rolled back precisely because it could not coexist with
> the current racy design. The race itself is the subject of this ticket.
> h2. Root cause
> {code:java}
> // KeyCacheObjectImpl
> @Override public void prepareMarshal(CacheObjectValueContext ctx) throws
> IgniteCheckedException {
> if (valBytes == null) // (1) unsynchronized read
> valBytes = ctx.marshal(val); // (2) unsynchronized write
> }
> {code}
> Two threads executing {{(1)}} concurrently both see {{null}} and both execute
> {{(2)}}. {{valBytes}} is not {{volatile}}, so publication is also unspecified
> — another thread reading the field after the race can observe {{null}} even
> after both writers completed on other cores.
> Same pattern in {{CacheObjectImpl.prepareMarshal}}.
> {{CacheObjectByteArrayImpl.prepareMarshal}} is a no-op and is unaffected.
> h2. Why it is a functional no-op today
> * {{val}} is effectively immutable once the {{CacheObject}} is constructed.
> * {{ctx.marshal(val)}} is deterministic: equal inputs produce equal byte
> arrays.
> * The NIO writer reaches {{valBytes}} via {{writeExternal}} / {{putValue}} /
> {{valueBytesLength}}, each of which calls {{valueBytes(ctx)}} that re-runs
> the idempotency check and will populate {{valBytes}} on the calling thread if
> both racers somehow lost their writes.
> So the race does not corrupt on-wire output in practice. What it does cost:
> * *Wasted CPU.* Duplicate {{Marshaller.marshal}} calls on contended keys (hot
> paths during tx prepare when the same key appears in multiple
> {{IgniteTxEntry}} instances).
> * *Wasted allocation.* Each losing racer produces a {{byte[]}} that's
> immediately garbage.
> * *Debugging friction.* Any future audit or assertion that presumes "marshal
> happens once per CO" cannot be added without first fixing this.
> * *Specification drift.* The contract is effectively "idempotent under
> concurrency by accident" — reliant on value immutability and marshaller
> determinism rather than stated invariants.
> h2. Reproduction
> * Add a transient {{boolean prepareMarshalCalled}} field on
> {{CacheObjectAdapter}} plus a helper that flips it under {{assert}}, then add
> {{assert assertFirstPrepareMarshal();}} as the first statement of
> {{KeyCacheObjectImpl.prepareMarshal}} (inside the {{if (valBytes == null)}}
> branch). Run {{IgniteCacheQueryMultiThreadedSelfTest}} under {{-ea}}; the
> assertion fires on a meaningful fraction of runs at the
> {{IgniteTxEntry.marshal → key.prepareMarshal}} call site.
> * Synthetic stress: share a single {{KeyCacheObjectImpl}} between two
> threads, call {{prepareMarshal}} concurrently, count the number of
> {{ctx.marshal(val)}} invocations via a counting {{CacheObjectValueContext}}
> mock — it exceeds 1 on a measurable fraction of runs.
> h2. Proposed fix
> Synchronize the check-and-set. Two options, in order of preference:
> # *Double-checked locking on a per-instance monitor*, keeping the hot no-op
> branch lock-free:
> {code:java}
> @Override public void prepareMarshal(CacheObjectValueContext ctx) throws
> IgniteCheckedException {
> if (valBytes != null)
> return;
> synchronized (this) {
> if (valBytes == null)
> valBytes = ctx.marshal(val);
> }
> }
> {code}
> Requires making {{valBytes}} {{volatile}} for safe DCL. Uncontended cost: one
> volatile read. Contended: {{synchronized}} on the {{CacheObject}} itself,
> which is already the per-key identity.
> # *Atomic publish via {{VarHandle}} / {{Unsafe.compareAndSetObject}}* on
> {{valBytes}} — winner's bytes publish, loser's bytes are garbage. Avoids
> entering {{synchronized}}, but requires a {{VarHandle}} field and the losing
> thread still does the marshal work.
> Option 1 eliminates both the race and the duplicate work; option 2 eliminates
> the race but preserves duplicate work. Option 1 is recommended.
> Mirror the same treatment in {{CacheObjectImpl.prepareMarshal}} (which calls
> {{valueBytesFromValue(ctx)}} instead of {{ctx.marshal(val)}}).
> h2. Scope
> * {{modules/core}}:
> ** {{KeyCacheObjectImpl.prepareMarshal}}
> ** {{CacheObjectImpl.prepareMarshal}}
> ** {{CacheObjectAdapter.valBytes}} → {{volatile byte[]}}
> * No wire-protocol change (field layout unchanged).
> * No public API change.
> h2. Acceptance criteria
> * Re-add the {{assertFirstPrepareMarshal}} probe described under
> "Reproduction" and confirm it does not fire in
> {{IgniteCacheQueryMultiThreadedSelfTest}} or other multi-threaded suites
> under {{-ea}}.
> * No measurable throughput regression on {{GridCachePutAllFailoverSelfTest}}
> and {{GridCacheRebalancingSyncSelfTest}} (both exercise the same CO marshal
> paths on the hot path).
> * New unit test: two threads race on a fresh
> {{KeyCacheObjectImpl.prepareMarshal}}; {{ctx.marshal(val)}} is invoked
> exactly once (verified via a counting {{CacheObjectValueContext}} mock).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)