[
https://issues.apache.org/jira/browse/IGNITE-28613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078206#comment-18078206
]
Sergey Chugunov commented on IGNITE-28613:
------------------------------------------
[~abashev], everything looks good to me, I merged your PR to master. Thank you
for contribution! Could you please resolve the ticket?
> GridNioSelfTest#testAsyncSendReceive: spurious "Invalid message size" in
> surefire log from listener ack pingpong
> ----------------------------------------------------------------------------------------------------------------
>
> Key: IGNITE-28613
> URL: https://issues.apache.org/jira/browse/IGNITE-28613
> Project: Ignite
> Issue Type: Task
> Reporter: Alex Abashev
> Assignee: Alex Abashev
> Priority: Minor
> Labels: IEP-132, ise
> Fix For: 2.19
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> h2. Summary
> {{GridNioSelfTest#testAsyncSendReceive}} logs an {{IgniteCheckedException:
> Invalid message size: -269488145}} (= {{0xEFEFEFEF}}) on every run on master.
> The JUnit assertion still passes (the latch fires before the exception), so
> the test doesn't go red — but the stack trace pollutes the surefire log and
> breaks any "no errors during test" CI guard.
> The original ticket attributed this to a use-after-free race between
> {{GridNioServer.close(...)}} calling {{cleanDirectBuffer}} and a concurrent
> {{processRead}}. After investigation that RCA is incorrect — see "Why the
> original RCA is wrong" below. The actual cause is a test bug introduced by
> IGNITE-23040.
> h2. Symptom
> Reproducible 100% on master on macOS aarch64:
> {noformat}
> [ERROR][grid-nio-worker-0-#42%nio-test-grid%][] Closing NIO session because
> of unhandled exception.
> class org.apache.ignite.IgniteCheckedException: Invalid message size:
> -269488145
> at
> org.apache.ignite.internal.util.nio.GridNioServerBuffer.read(GridNioServerBuffer.java:82)
> at
> org.apache.ignite.internal.util.nio.GridBufferedParser.decode(GridBufferedParser.java:69)
> at
> org.apache.ignite.internal.util.nio.GridBufferedParser.decode(GridBufferedParser.java:36)
> at
> org.apache.ignite.internal.util.nio.GridNioCodecFilter.onMessageReceived(GridNioCodecFilter.java:114)
> ...
> at
> org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:1235)
> ...
> {noformat}
> 20/20 runs hit it on a recent Apple Silicon JDK 11 build.
> h2. Root cause
> The {{NioListener}} shared by both {{GridNioServer}} peers in this test sends
> an ack on every received message:
> {code:java}
> //sending ack of processing message
> ses.send(new byte[] {(byte)0xDEADBEEF});
> {code}
> {{(byte)0xDEADBEEF}} truncates to {{0xEF}} — the ack is a *one-byte* message
> containing the single byte {{0xEF}}.
> The test's {{BufferedParser}} overrides {{encode}} to skip the size prefix:
> {code:java}
> @Override public ByteBuffer encode(GridNioSession ses, Object msg) {
> // IO manager creates array ready to send.
> return msg instanceof byte[] ? ByteBuffer.wrap((byte[])msg) :
> (ByteBuffer)msg;
> }
> {code}
> So the ack hits the wire as a *single raw byte* {{0xEF}} — no 4-byte size
> header.
> The same {{NioListener}} is wired to both {{srvr1}} and {{srvr2}}, so every
> received message produces an ack going back the other way → ping-pong. Once
> four ack bytes accumulate on either side's read buffer, the receiver's
> {{GridBufferedParser.decode}} reads them as the next message's 4-byte size
> header:
> {noformat}
> msgSize = 0xEF << 24 | 0xEF << 16 | 0xEF << 8 | 0xEF
> = 0xEFEFEFEF
> = -269488145 (signed)
> {noformat}
> {{GridNioServerBuffer.read}} then throws {{Invalid message size: -269488145}}.
> The 10 originals (each properly size-prefixed by {{createMessageWithSize()}})
> arrive and count down the latch *before* enough ack bytes accumulate, which
> is why the assertion still passes.
> h2. Why it surfaces now
> Introduced by IGNITE-23040 "Fix GridNioSslSelfTest" ([PR
> #11505|https://github.com/apache/ignite/pull/11505]), commit {{8782b4882ff}},
> 24 Sep 2024. That commit added the ack send to fix the *SSL* self-test (where
> the peer is a {{TestClient}} that explicitly calls
> {{client.receiveMessage()}} to drain the ack), but it modified the *shared*
> {{NioListener}} rather than introducing a dedicated one. For
> {{testAsyncSendReceive}}, where both peers run a {{GridNioServer}} with the
> size-prefix-skipping {{BufferedParser}}, the protocol mismatch produces the
> spurious error.
> Master has shipped this log noise on every run for ~1.5 years. The assertion
> never fires because the test's contract (10 originals received) is satisfied
> before the parser blows up on the malformed ack stream.
> h2. Why the original RCA is wrong
> The original ticket attributed the symptom to:
> * {{GridNioServer.close(...)}} calling
> {{cleanDirectBuffer(ses.readBuffer())}} *before* cancelling {{ses.key()}},
> while a worker is still inside {{processRead(...)}} on that session.
> Two reasons this can't be the cause in this test:
> # {{testAsyncSendReceive}} doesn't set {{.directMode(true)}} on the
> {{GridNioServer.Builder}}. When {{directMode=false}}, {{createSession}}
> leaves {{ses.writeBuffer()}}/{{ses.readBuffer()}} {{null}}. The whole
> {{cleanDirectBuffer}} block in {{close(...)}} is gated by null-checks — it
> never runs for this test.
> # Even with {{directMode=true}}, {{close(...)}} and {{processRead(...)}}
> cannot run concurrently for the same session. Every {{close}} path goes
> through the owning worker thread:
> #* Public {{GridNioServer.close(ses)}} — {{offerStateChange(CLOSE)}} →
> drained inside the worker's {{bodyInternal}} loop.
> #* {{closeFromWorkerThread(ses)}} — only caller is
> {{InboundConnectionHandler.onMessage}}, itself running on the worker.
> #* {{processRead}}/{{processWrite}} exception handlers — same worker thread.
> #* {{processSelectedKeysOptimized}} catch — same worker thread.
> #* {{srvr.stop()}} — {{U.cancel(clientWorkers)}} → {{U.join(...)}}; the
> per-worker {{finally}} in {{bodyInternal}} performs the closes.
> The worker is single-threaded, and {{processSelectedKeysOptimized}} skips
> keys with {{!key.isValid()}} — which becomes {{false}} immediately after
> {{key.cancel()}}. The race window described in the original RCA does not
> exist.
> h2. Fix
> Stop the ack from polluting the wire in {{testAsyncSendReceive}} by giving it
> its own listener that doesn't send acks. The default {{NioListener}} ack
> stays where it's meaningful ({{testSendReceive}},
> {{testMultiThreadedSendReceive}}, the SSL tests — all of which use
> {{TestClient}} with matching {{client.receiveMessage()}} reads).
> {code:java}
> // In GridNioSelfTest.java
> // Make NioListener fields visible to the subclass:
> private static class NioListener extends GridNioServerListenerAdapter<byte[]>
> {
> protected final AtomicInteger msgCnt = new AtomicInteger(0);
> protected final AtomicBoolean sizeFailed = new AtomicBoolean(false);
> protected final CountDownLatch latch;
> ...
> }
> // New listener for testAsyncSendReceive:
> private static class NoAckNioListener extends NioListener {
> NoAckNioListener(CountDownLatch latch) { super(latch); }
> @Override public void onMessage(GridNioSession ses, byte[] data) {
> int expMsgSize = getExpectedMessageSize();
> if (data == null || (expMsgSize != 0 && data.length != expMsgSize))
> sizeFailed.set(true);
> msgCnt.incrementAndGet();
> if (latch != null)
> latch.countDown();
> }
> }
> // In testAsyncSendReceive:
> NioListener lsnr = new NoAckNioListener(latch);
> {code}
> h2. Acceptance criteria
> * 20 consecutive runs of {{GridNioSelfTest#testAsyncSendReceive}} produce no
> {{Invalid message size}} entries in the surefire log (currently 20/20 hit it
> on macOS aarch64).
> * {{testSendReceive}}, {{testMultiThreadedSendReceive}} and the SSL
> self-tests stay green — they keep using the ack-sending {{NioListener}}.
> h2. Out of scope
> * No production code changes. The original ticket's proposed change to
> {{GridNioServer.close(...)}} (move {{closeKey}} before {{cleanDirectBuffer}})
> is unnecessary because the race it targets cannot happen — see "Why the
> original RCA is wrong".
> * No changes to {{BufferedParser}} / {{GridBufferedParser}} /
> {{GridNioServerBuffer}} — the protocol mismatch is purely a test-side issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)