[ 
https://issues.apache.org/jira/browse/SOLR-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated SOLR-14861:
----------------------------------
    Description: 
Noble and I are trying to get to the bottom of the TestBulkSchemaConcurrent 
failures and found what looks like a glaring gap in how CoreContainer.shutdown 
operates. I don't know the impact on production since we're shutting down 
anyway, but I think this is responsible for the errors in 
TestBulkSchemaConcurrent and likely behind others, especially any other test 
that fails intermittently that involves core reloads, including and especially 
any tests that exercise managed schema.

We have clear evidence of this sequence:

1> some CoreContainer.reloads come in and get _partway_ through, in particular 
past the test at the top where CoreContainer.reload() throws an AlreadyClosed 
exception if (isShutdown).

2> Some CoreContainer.shutdown() threads get some processing time before the 
reloads in <1> are finished.

3> the threads in <1> pick back up and go wonky. I suspect that there are a 
number of different things that could be going wrong here depending on how far 
through CoreContainer.shutdown() gets that pop out in different ways.

Since it's my shift (Noble has to sleep sometime), I put some crude locking in 
just to test the idea; incrementing an AtomicInteger on entry to 
CoreContainer.reload then decrementing it at the end, and spinning in 
CoreContainer.shutdown() until the AtomicInteger was back to zero. With that in 
place, 100 runs and no errors whereas before I could never get even 10 runs to 
finish without an error. This is not a proper fix at all, and the way it's 
currently running there are still possible race conditions, just much smaller 
windows. And I suspect it risks spinning forever. But it's enough to make me 
believe I finally understand what's happening.

I also suspect that reload is more sensitive than most operations on a core due 
to the fact that it runs for a long time, but I assume other operations have 
the same potential. Shouldn't CoreContainer.shutDown() wait until no other 
operations are in flight?

On a quick scan of CoreContainer, there are actually few places where we even 
check for isShutdown, I suspect the places we do are ad-hoc that we've found by 
trial-and-error when tests fail. We need a design rather than hit-or-miss 
hacking.

I think that isShutdown should be replaced with something more robust. What 
that is IDK quite yet because I've been hammering at this long enough and I 
need a break.

This is consistent with another observation about this particular test. If 
there's sleep at the end, it wouldn't fail; all the reloads get a chance to 
finish before anything was shut down.

An open question how much this matters to production systems. In the testing 
case, bunches of these reloads are issued then we immediately end the test and 
start shutting things down. It needs to be fixed if we're going to cut down on 
test failures though. Besides, it's just wrong ;)

Assigning to myself to track. I'd be perfectly happy, now that Noble and I have 
done the hard work, for someone to swoop in and take the credit for fixing it ;)

gradlew beast -Ptests.dups=10 --tests TestBulkSchemaConcurrent

always fails for me on current code without my hack...

  was:
Noble and I are trying to get to the bottom of the TestBulkSchemaConcurrent 
failures and found what looks like a glaring gap in how CoreContainer.shutdown 
operates. I don't know the impact on production since we're shutting down 
anyway, but I think this is responsible for the errors in 
TestBulkSchemaConcurrent and likely behind others, especially any other test 
that fails intermittently that involves core reloads, including and especially 
any tests that exercise managed schema.

We have clear evidence of this sequence:

1> some CoreContainer.reloads come in and get _partway_ through, in particular 
past the test at the top where CoreContainer.reload() throws an AlreadyClosed 
exception if (isShutdown).

2> the reloads happening in <1> are interrupted by threads running 
CoreContainer.shutdown(), which do at least some processing.

3> the threads in <1> pick back up and go wonky. I suspect that there are a 
number of different things that could be going wrong here depending on how far 
through CoreContainer.shutdown() gets that pop out in different ways.

Since it's my shift (Noble has to sleep sometime), I put some crude locking in 
just to test the idea; incrementing an AtomicInteger on entry to 
CoreContainer.reload then decrementing it at the end, and spinning in 
CoreContainer.shutdown() until the AtomicInteger was back to zero. With that in 
place, 100 runs and no errors whereas before I could never get even 10 runs to 
finish without an error. This is not a proper fix at all, and the way it's 
currently running there are still possible race conditions, just much smaller 
windows. And I suspect it risks spinning forever. But it's enough to make me 
believe I finally understand what's happening.

I also suspect that reload is more sensitive than most operations on a core due 
to the fact that it runs for a long time, but I assume other operations have 
the same potential. Shouldn't CoreContainer.shutDown() wait until no other 
operations are in flight?

On a quick scan of CoreContainer, there are actually few places where we even 
check for isShutdown, I suspect the places we do are ad-hoc that we've found by 
trial-and-error when tests fail. We need a design rather than hit-or-miss 
hacking.

I think that isShutdown should be replaced with something more robust. What 
that is IDK quite yet because I've been hammering at this long enough and I 
need a break.

This is consistent with another observation about this particular test. If 
there's sleep at the end, it wouldn't fail; all the reloads get a chance to 
finish before anything was shut down.

An open question how much this matters to production systems. In the testing 
case, bunches of these reloads are issued then we immediately end the test and 
start shutting things down. It needs to be fixed if we're going to cut down on 
test failures though. Besides, it's just wrong ;)

Assigning to myself to track. I'd be perfectly happy, now that Noble and I have 
done the hard work, for someone to swoop in and take the credit for fixing it ;)

gradlew beast -Ptests.dups=10 --tests TestBulkSchemaConcurrent

always fails for me on current code without my hack...


> CoreContainer shutdown needs to be aware of other ongoing operations and wait 
> until they're complete
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-14861
>                 URL: https://issues.apache.org/jira/browse/SOLR-14861
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Major
>
> Noble and I are trying to get to the bottom of the TestBulkSchemaConcurrent 
> failures and found what looks like a glaring gap in how 
> CoreContainer.shutdown operates. I don't know the impact on production since 
> we're shutting down anyway, but I think this is responsible for the errors in 
> TestBulkSchemaConcurrent and likely behind others, especially any other test 
> that fails intermittently that involves core reloads, including and 
> especially any tests that exercise managed schema.
> We have clear evidence of this sequence:
> 1> some CoreContainer.reloads come in and get _partway_ through, in 
> particular past the test at the top where CoreContainer.reload() throws an 
> AlreadyClosed exception if (isShutdown).
> 2> Some CoreContainer.shutdown() threads get some processing time before the 
> reloads in <1> are finished.
> 3> the threads in <1> pick back up and go wonky. I suspect that there are a 
> number of different things that could be going wrong here depending on how 
> far through CoreContainer.shutdown() gets that pop out in different ways.
> Since it's my shift (Noble has to sleep sometime), I put some crude locking 
> in just to test the idea; incrementing an AtomicInteger on entry to 
> CoreContainer.reload then decrementing it at the end, and spinning in 
> CoreContainer.shutdown() until the AtomicInteger was back to zero. With that 
> in place, 100 runs and no errors whereas before I could never get even 10 
> runs to finish without an error. This is not a proper fix at all, and the way 
> it's currently running there are still possible race conditions, just much 
> smaller windows. And I suspect it risks spinning forever. But it's enough to 
> make me believe I finally understand what's happening.
> I also suspect that reload is more sensitive than most operations on a core 
> due to the fact that it runs for a long time, but I assume other operations 
> have the same potential. Shouldn't CoreContainer.shutDown() wait until no 
> other operations are in flight?
> On a quick scan of CoreContainer, there are actually few places where we even 
> check for isShutdown, I suspect the places we do are ad-hoc that we've found 
> by trial-and-error when tests fail. We need a design rather than hit-or-miss 
> hacking.
> I think that isShutdown should be replaced with something more robust. What 
> that is IDK quite yet because I've been hammering at this long enough and I 
> need a break.
> This is consistent with another observation about this particular test. If 
> there's sleep at the end, it wouldn't fail; all the reloads get a chance to 
> finish before anything was shut down.
> An open question how much this matters to production systems. In the testing 
> case, bunches of these reloads are issued then we immediately end the test 
> and start shutting things down. It needs to be fixed if we're going to cut 
> down on test failures though. Besides, it's just wrong ;)
> Assigning to myself to track. I'd be perfectly happy, now that Noble and I 
> have done the hard work, for someone to swoop in and take the credit for 
> fixing it ;)
> gradlew beast -Ptests.dups=10 --tests TestBulkSchemaConcurrent
> always fails for me on current code without my hack...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to