[
https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719000#comment-16719000
]
Jason Gerlowski edited comment on SOLR-13037 at 12/12/18 1:57 PM:
------------------------------------------------------------------
To (hopefully) explain things a little more clearly, here's the race condition
I think we're running into here. There's a few sections of
{{TestSimGenericDistributedQueue}} that seem to fail, but let's zoom in on one
in particular. Check out TestSimDistributedQueue lines 73-74:
{code}
(new QueueChangerThread(dq,1000)).start();
assertNotNull(dq.peek(15000));
{code}
This test code has two threads of interest. The QueueChangerThread we see
created here will sleep for one second, and then insert data into the queue.
Meanwhile the main test thread will wait for some data to be inserted into the
queue. Our queue-reading waits a pretty generous amount of time for things to
enter the queue, so the insert should always finish in time and the read should
always pick it up.
Some more detail on the operation of each queue operation happens. First the
queue-write (i.e. {{offer()}}):
- [Acquire lock
'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L461]
- [Create queue entry node and attach it to
parent|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L324]
- [Wake up any threads sleeping on the 'changed'
Condition|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L593]
- [Release lock
'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L465]
- [Set data for queue
entry|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L468]
Now the queue-read. Queue-reading works off of a cache of "known queue entries"
and most queue-reads are handled from there. But the test failure only occurs
when we need to refresh this cache and read straight from ZK, so I'll skip the
cache logic here.
- [Acquire lock
'updateLock'|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L186]
- [loop until we're out of time to
wait:|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L189]
** [look for an element and return if
non-null|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L190]
** [sleep until we receive a wakeup from 'changed' Condition or we time
out.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L194]
- [Release lock
'updateLock'.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L198]
There's a problem with the queue-write code above. We wake up threads after
creating the queue-entry, but before it's fully initialized with its data.
This opens the door to readers seeing the data before it's fully ready and
going back to sleep. The 'changed' signalling has already happened, so any
readers that see the data too early will go back to sleep and not wake up again
until timeout.
There's a few ways we can fix this:
- we could add a `changed.signalAll()` call at the end of {{offer()}}, to
ensure that there's at least 1 wakeup after the data has been fully added.
- we can alter the flow of SimDistribStateManager.createData so that the node
is only attached to the tree after its data has been fully initialized
- we could register a Watcher that triggers on "data-changed", similar to how
we already trigger a watcher on "child-added"
was (Author: gerlowskija):
To (hopefully) explain things a little more clearly, here's the race condition
I think we're running into here. There's a few sections of
{{TestSimGenericDistributedQueue}} that seem to fail, but let's zoom in on one
in particular. Check out TestSimDistributedQueue lines 73-74:
(code}
(new QueueChangerThread(dq,1000)).start();
assertNotNull(dq.peek(15000));
{code}
This test code has two threads of interest. The QueueChangerThread we see
created here will sleep for one second, and then insert data into the queue.
Meanwhile the main test thread will wait for some data to be inserted into the
queue. Our queue-reading waits a pretty generous amount of time for things to
enter the queue, so the insert should always finish in time and the read should
always pick it up.
Some more detail on the operation of each queue operation happens. First the
queue-write (i.e. {{offer()}}):
- [Acquire lock
'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L461]
- [Create queue entry node and attach it to
parent|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L324]
- [Wake up any threads sleeping on the 'changed'
Condition|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L593]
- [Release lock
'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L465]
- [Set data for queue
entry|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L468]
Now the queue-read. Queue-reading works off of a cache of "known queue entries"
and most queue-reads are handled from there. But the test failure only occurs
when we need to refresh this cache and read straight from ZK, so I'll skip the
cache logic here.
- [Acquire lock
'updateLock'|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L186]
- [loop until we're out of time to
wait:|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L189]
** [look for an element and return if
non-null|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L190]
** [sleep until we receive a wakeup from 'changed' Condition or we time
out.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L194]
- [Release lock
'updateLock'.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L198]
There's a problem with the queue-write code above. We wake up threads after
creating the queue-entry, but before it's fully initialized with its data.
This opens the door to readers seeing the data before it's fully ready and
going back to sleep. The 'changed' signalling has already happened, so any
readers that see the data too early will go back to sleep and not wake up again
until timeout.
There's a few ways we can fix this:
- we could add a `changed.signalAll()` call at the end of {{offer()}}, to
ensure that there's at least 1 wakeup after the data has been fully added.
- we can alter the flow of SimDistribStateManager.createData so that the node
is only attached to the tree after its data has been fully initialized
- we could register a Watcher that triggers on "data-changed", similar to how
we already trigger a watcher on "child-added"
> Harden TestSimGenericDistributedQueue.
> --------------------------------------
>
> Key: SOLR-13037
> URL: https://issues.apache.org/jira/browse/SOLR-13037
> Project: Solr
> Issue Type: Sub-task
> Security Level: Public(Default Security Level. Issues are Public)
> Components: Tests
> Reporter: Mark Miller
> Assignee: Jason Gerlowski
> Priority: Major
> Attachments: repro-log.txt
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]