[
https://issues.apache.org/jira/browse/DIRMINA-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Lecharny updated DIRMINA-753:
--------------------------------------
Fix Version/s: (was: 2.0.0-RC2)
2.0.0
> Some writes observed to stall until select times out
> ----------------------------------------------------
>
> Key: DIRMINA-753
> URL: https://issues.apache.org/jira/browse/DIRMINA-753
> Project: MINA
> Issue Type: Bug
> Components: Core
> Affects Versions: 2.0.0-RC1
> Environment: java version "1.6.0_16"
> Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)
> Reporter: John R. Fallows
> Assignee: Emmanuel Lecharny
> Fix For: 2.0.0
>
>
> We have observed an issue in Mina 2.0.0-RC1 where TCP writes are not flushed
> to the network even though the receiver is keeping up and CPU has low
> utilization.
> In the test scenario, all of the clients initially connect to the server,
> then the server starts writing messages to each of the clients. The clients
> connect via localhost to eliminate clock drift and physical network impact,
> and we measure the observable latency of the message between sender and
> receiver.
> Note that the clients do not send any data to the server after messages start
> flowing from the server, just ack traffic.
> When the problem occurs, some messages are unexpectedly delayed by
> approximately 1000ms.
> By observing a tcpdump of the wire traffic during the test, we verified that
> the receiver was keeping up with message delivery, but that message delivery
> would sometimes stall for approximately 1000ms after which the delayed
> messages would be written to the wire.
> By re-running the same test with a locally modified version of
> AbstractPollingIoProcessor, such that the select() timeout was 750ms
> instead of 1000ms, we found that the messages were delayed by approximately
> 750ms instead.
> Further inspection of AbstractPollingIoProcessor.flush(T session) suggests a
> potential race condition that would explain this behavior:
> public final void flush(T session) {
> boolean needsWakeup = flushingSessions.isEmpty();
> if (scheduleFlush(session) && needsWakeup) {
> wakeup();
> }
> }
> For the purposes of this description, let's say
> AbstractPollingIoProcessor.flush(T session) is called by the [writer]
> thread, while Selector.select(long timeout) is called by the [processor]
> thread.
> [writer] flushingSessions.isEmpty() -> false (last flushing session
> being processed)
> [processor] flush(currentTime), flushingSessions.poll() empties
> flushingSessions, completely write last flushing session, write request queue
> empty, no subsequent flush scheduled
> [writer] scheduleFlush(session) returns true, flushingSessions now
> non-empty, requires wakeup(), but not called (!)
> [processor] stalls until select timeout expires, then flushes
> It seems that the intent of needsWakeup is to avoid calling wakeup()
> unnecessarily by tracking when flushingSessions becomes non-empty, but it
> does not seem to capture the edge case where flushSessions starts out
> non-empty, then becomes empty, then becomes non-empty again, triggering the
> stall.
> We found that the following modification to
> AbstractPollingIoProcessor.flush(T session) addressed the issue without
> adverse effects.
> public final void flush(T session) {
> if (scheduleFlush(session)) {
> wakeup();
> }
> }
> The difference in behavior is as follows:
> 1. wakeup() could be called many times between calls to select()
> 2. select() might unblock one extra time when the selector is otherwise idle
> Case (1) is documented by Selector.select to be equivalent to a single call
> to wakeup() between calls to select(). The assumption is that subsequent
> calls to wakeup() are cheap because it is latched internally.
> Case (2) can occur as follows:
> [writer] scheduleFlush(session) returns true
> [processor] processes all flushingSessions, including most recent addition by
> [writer]
> [writer] calls wakeup()
> [processor] immediately unblocks next call to select() due to wakeup() call
> by [writer]
> [processor] blocks again on next call to select()
> When the [processor] unblocks select() due to wakeup() call by [writer], this
> is not an _extra_ unblock when there is other work to
> do in the [processor] loop, so any additional overhead introduced by
> eliminating the needsWakeup check is only observed when the selector
> is otherwise idle, and is therefore considered harmless.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.