[ 
https://issues.apache.org/jira/browse/SOLR-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719000#comment-16719000
 ] 

Jason Gerlowski edited comment on SOLR-13037 at 12/12/18 1:58 PM:
------------------------------------------------------------------

To (hopefully) explain things a little more clearly, here's the race condition 
I think we're running into here.  There's a few sections of 
{{TestSimGenericDistributedQueue}} that seem to fail, but let's zoom in on one 
in particular.  Check out TestSimDistributedQueue lines 73-74:

{code}
 (new QueueChangerThread(dq,1000)).start();
 assertNotNull(dq.peek(15000));
{code}

This test code has two threads of interest. The QueueChangerThread we see 
created here will sleep for one second, and then insert data into the queue. 
Meanwhile the main test thread will wait for some data to be inserted into the 
queue. Our queue-reading waits a pretty generous amount of time for things to 
enter the queue, so the insert should always finish in time and the read should 
always pick it up.

Some more detail on the operation of each queue operation happens. First the 
queue-write (i.e. {{offer()}}):
 - [Acquire lock 
'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L461]
 - [Create queue entry node and attach it to 
parent|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L324]
 - [Wake up any threads sleeping on the 'changed' 
Condition|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L593]
 - [Release lock 
'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L465]
 - [Set data for queue 
entry|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L468]

Now the queue-read. Queue-reading works off of a cache of "known queue entries" 
and most queue-reads are handled from there. But the test failure only occurs 
when we need to refresh this cache and read straight from ZK, so I'll skip the 
cache logic here.
 - [Acquire lock 
'updateLock'|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L186]
 - [loop until we're out of time to 
wait:|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L189]
 ** [look for an element and return if 
non-null|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L190]
 ** [sleep until we receive a wakeup  from 'changed' Condition or we time 
out.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L194]
 - [Release lock 
'updateLock'.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L198]

There's a problem with the queue-write code above.  We wake up threads after 
creating the queue-entry, but before it's fully initialized with its data.  
This opens the door to readers seeing the data before it's fully ready and 
going back to sleep.  The 'changed' signalling has already happened, so any 
readers that see the data too early will go back to sleep and not wake up again 
until timeout.

There's a few ways we can fix this:
- we could add a `changed.signalAll()` call at the end of {{offer()}}, to 
ensure that there's at least 1 wakeup after the data has been fully added.
- we can alter the flow of SimDistribStateManager.createData so that the node 
is only attached to the tree after its data has been fully initialized
- we could register a Watcher that triggers on "data-changed", similar to how 
we already trigger a watcher on "child-added"  
 
I think the second option is probably the "right" fix here, so I'll pursue that 
unless others have other opinions.


was (Author: gerlowskija):
To (hopefully) explain things a little more clearly, here's the race condition 
I think we're running into here.  There's a few sections of 
{{TestSimGenericDistributedQueue}} that seem to fail, but let's zoom in on one 
in particular.  Check out TestSimDistributedQueue lines 73-74:

{code}
 (new QueueChangerThread(dq,1000)).start();
 assertNotNull(dq.peek(15000));
{code}

This test code has two threads of interest. The QueueChangerThread we see 
created here will sleep for one second, and then insert data into the queue. 
Meanwhile the main test thread will wait for some data to be inserted into the 
queue. Our queue-reading waits a pretty generous amount of time for things to 
enter the queue, so the insert should always finish in time and the read should 
always pick it up.

Some more detail on the operation of each queue operation happens. First the 
queue-write (i.e. {{offer()}}):
 - [Acquire lock 
'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L461]
 - [Create queue entry node and attach it to 
parent|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L324]
 - [Wake up any threads sleeping on the 'changed' 
Condition|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L593]
 - [Release lock 
'multilock'|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L465]
 - [Set data for queue 
entry|https://github.com/apache/lucene-solr/blob/18356de83738d64e619898016d873993ec474d17/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimDistribStateManager.java#L468]

Now the queue-read. Queue-reading works off of a cache of "known queue entries" 
and most queue-reads are handled from there. But the test failure only occurs 
when we need to refresh this cache and read straight from ZK, so I'll skip the 
cache logic here.
 - [Acquire lock 
'updateLock'|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L186]
 - [loop until we're out of time to 
wait:|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L189]
 ** [look for an element and return if 
non-null|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L190]
 ** [sleep until we receive a wakeup  from 'changed' Condition or we time 
out.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L194]
 - [Release lock 
'updateLock'.|https://github.com/apache/lucene-solr/blob/8cde1277ec7151bd6ab62950ac93cbdd6ff04d9f/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/GenericDistributedQueue.java#L198]

There's a problem with the queue-write code above.  We wake up threads after 
creating the queue-entry, but before it's fully initialized with its data.  
This opens the door to readers seeing the data before it's fully ready and 
going back to sleep.  The 'changed' signalling has already happened, so any 
readers that see the data too early will go back to sleep and not wake up again 
until timeout.

There's a few ways we can fix this:
- we could add a `changed.signalAll()` call at the end of {{offer()}}, to 
ensure that there's at least 1 wakeup after the data has been fully added.
- we can alter the flow of SimDistribStateManager.createData so that the node 
is only attached to the tree after its data has been fully initialized
- we could register a Watcher that triggers on "data-changed", similar to how 
we already trigger a watcher on "child-added"  
 

> Harden TestSimGenericDistributedQueue.
> --------------------------------------
>
>                 Key: SOLR-13037
>                 URL: https://issues.apache.org/jira/browse/SOLR-13037
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Tests
>            Reporter: Mark Miller
>            Assignee: Jason Gerlowski
>            Priority: Major
>         Attachments: repro-log.txt
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to