[
https://issues.apache.org/jira/browse/DIRMINA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16842072#comment-16842072
]
Emmanuel Lecharny commented on DIRMINA-1107:
--------------------------------------------
Forget about leveraging the {{event}} method: as it is initiated in the
{{SslFilter}}, it won't be available in this very filter...
The big issue I have atm is how to ensure that the messages we have enqueued
will be flushed when and only when the handshake have been completed, and how
to guarantee that no new message is going to be enqueued while we are flushing
them. This can typically happen when we start flushing the existing pending
messages, while some thread is writing a new message: we must wait until all
the existing messages in the queue have been written. Also we don't want to
keep this queue forever: once the handshake has been completed, and the queue
flushed, it is not useful anymore.
Actually, we need two barriers:
* one for messages that are produced *before* the handshake has started but
*after* the {{SslFilter}} has been added into the chain (this is only something
to deal with when we inject the filter into an active session - like when using
startTLS -)
* one for messages that are produced during the handshake
Between those two barriers, all messages are enqueued, after those two
barriers, no message is enqueued. And the tricky part: while the queue is
flushed, no message should be added into the queue, but we must not be flushed
either (to guarantee the messages' order):
{noformat}
--------[entry barrier]-------------------[exit
barrier]------------------[queue flushed]------------
|<---- messages are enqueued ---->|<---- messages are
waiting ---->|
{noformat}
Now to find a way to code that...
> SslHandler flushScheduledEvents race condition, redux
> -----------------------------------------------------
>
> Key: DIRMINA-1107
> URL: https://issues.apache.org/jira/browse/DIRMINA-1107
> Project: MINA
> Issue Type: Bug
> Affects Versions: 2.1.2
> Reporter: Guus der Kinderen
> Priority: Major
> Fix For: 2.1.3
>
>
> DIRMINA-1019 addresses a race condition in SslHandler, but unintentionally
> replaces it with another multithreading issue.
> The fix for DIRMINA-1019 introduces a counter that contains the number of
> events to be processed. A simplified version of the code is included below.
> {code:java}
> private final AtomicInteger scheduledEvents = new AtomicInteger(0);
> void flushScheduledEvents() {
> scheduledEvents.incrementAndGet();
> if (sslLock.tryLock()) {
> try {
> do {
> while ((event = filterWriteEventQueue.poll()) != null) {
> // ...
> }
>
> while ((event = messageReceivedEventQueue.poll()) != null){
> // ...
> }
> } while (scheduledEvents.decrementAndGet() > 0);
> } finally {
> sslLock.unlock();
> }
> }
> }{code}
> We have observed occasions where the value of {{scheduledEvents}} becomes a
> negative value, while at the same time {{filterWriteEventQueue}} go
> unprocessed.
> We suspect that this issue is triggered by a concurrency issue caused by the
> first thread decrementing the counter after a second thread incremented it,
> but before it attempted to acquire the lock.
> This allows the the first thread to empty the queues, decrementing the
> counter to zero and release the lock, after which the second thread acquires
> the lock successfully. Now, the second thread processes any elements in
> {{filterWriteEventQueue}}, and then processes any elements in
> {{messageReceivedEventQueue}}. If in between these two checks yet another
> thread adds a new element to {{filterWriteEventQueue}}, this element can go
> unprocessed (as the second thread does not loop, since the counter is zero or
> negative, and the third thread can fail to acquire the lock).
> It's a seemingly unlikely scenario, but we are observing the behavior when
> our systems are under high load.
> We've applied a code change after which this problem is no longer observed.
> We've removed the counter, and check on the size of the queues instead:
> {code:java}
> void flushScheduledEvents() {
> if (sslLock.tryLock()) {
> try {
> do {
> while ((event = filterWriteEventQueue.poll()) != null) {
> // ...
> }
>
> while ((event = messageReceivedEventQueue.poll()) != null){
> // ...
> }
> } while (!filterWriteEventQueue.isEmpty() ||
> !messageReceivedEventQueue.isEmpty());
> } finally {
> sslLock.unlock();
> }
> }
> }{code}
> This code change, as illustrated above, does introduce a new potential
> problem. Theoretically, an event could be added to the queues and
> {{flushScheduledEvents}} be called returning {{false}} for
> {{sslLock.tryLock()}}, exactly after another thread just finished the
> {{while}} loop, but before releasing the lock. This again would cause events
> to go unprocessed.
> We've not observed this problem in the wild yet, but we're uncomfortable
> applying this change as-is.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)