Re: Why MultiThreadedOCPTest (sometimes) fails

2020-05-30 Thread Erick Erickson
DIdn’t mean to imply that the test _should_ be serialized, rather that I’ve 
inadvertently done so when something shouldn’t be…

Have fun!

> On May 30, 2020, at 6:23 PM, Ilan Ginzburg  wrote:
> 
> Erick,
> 
> Serializing isn't really an option here, we want to test that execution order 
> is not submission order...
> 
> I believe if we wanted to verify the property that this test seems to assert: 
> "when there are more tasks than the number of executor threads and all are 
> blocked on a single lock, then a task enqueued afterwards and requiring 
> another available lock will be run right away", then maybe a unit test of 
> OverseerTaskProcessor/OverseerCollectionConfigSetProcessor would be 
> appropriate. But doing so would require refactoring: the run() method of 
> OverseerTaskProcessor is 200 lines.
> 
> I'll likely go with an "overkill" option. It will still not be free from 
> timing issues, but hopefully less vulnerable. And I'll try to identify when 
> test preconditions do not hold, in order to fail with an informative message 
> such as "slow task finished execution too soon, cannot test running tasks in 
> parallel".
> 
> Will create a new Jira.
> 
> Thanks for your week-end feedback,
> Ilan
> 
> On Sat, May 30, 2020 at 11:26 PM Erick Erickson  
> wrote:
> Ilan:
> 
> Please raise a new JIRA and attach any fixes there, a Git PR or patch 
> whichever you prefer. Your analysis will get lost in the noise for 
> SOLR-12801. And thanks for digging! We usually title them something like 
> “harden MultiThreadedOCPTest” or “fix failing MultiThreadedOCPTest” or 
> similar.
> 
> I haven’t really looked at the code you’re analyzing, it’s the weekend after 
> all ;). So ignore this question if it makes no sense. Anyway, I’m always 
> suspicious of using delays for just this kind of reason; depending on the 
> hardware something may fail sometimes. Is there any way to make the 
> sequencing more of a positive lock? Unfortunately that can be tricky, I’ve 
> wound up doing things like serializing all tasks which…er…isn’t a good fix 
> ;). But it sounds like your “overkill” section is along those lines? 
> 
> Up to you.
> 
> Meanwhile, I’ve started beasting that test on my spare machine. If you’re 
> curious about what that is: 
> https://gist.github.com/markrmiller/dbdb792216dc98b018ad
> 
> But don’t worry about that part. The point is I can run the test multiple 
> times, with N running in parallel, hopefully making my machine exhibit the 
> problem sometime over the night. If I can get it to fail before but not after 
> your fix, it provides some reassurance that your fix is working. Not totally 
> certain of course. Otherwise, we’ll just commit your fixes and see if Hoss’ 
> rollups stop showing it.
> 
> Thanks again!
> Erick
> 
> 
> 
> > On May 30, 2020, at 1:42 PM, Ilan Ginzburg  wrote:
> > 
> > MultiThreadedOCPTest
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Why MultiThreadedOCPTest (sometimes) fails

2020-05-30 Thread Ilan Ginzburg
Erick,

Serializing isn't really an option here, we want to test that execution
order is *not* submission order...

I believe if we wanted to verify the property that this test seems to
assert: "when there are more tasks than the number of executor threads and
all are blocked on a single lock, then a task enqueued afterwards and
requiring another available lock will be run right away", then maybe a unit
test of OverseerTaskProcessor/OverseerCollectionConfigSetProcessor would be
appropriate. But doing so would require refactoring: the run() method of
OverseerTaskProcessor is 200 lines.

I'll likely go with an "overkill" option. It will still not be free from
timing issues, but hopefully less vulnerable. And I'll try to identify when
test preconditions do not hold, in order to fail with an informative
message such as "slow task finished execution too soon, cannot test running
tasks in parallel".

Will create a new Jira.

Thanks for your week-end feedback,
Ilan

On Sat, May 30, 2020 at 11:26 PM Erick Erickson 
wrote:

> Ilan:
>
> Please raise a new JIRA and attach any fixes there, a Git PR or patch
> whichever you prefer. Your analysis will get lost in the noise for
> SOLR-12801. And thanks for digging! We usually title them something like
> “harden MultiThreadedOCPTest” or “fix failing MultiThreadedOCPTest” or
> similar.
>
> I haven’t really looked at the code you’re analyzing, it’s the weekend
> after all ;). So ignore this question if it makes no sense. Anyway, I’m
> always suspicious of using delays for just this kind of reason; depending
> on the hardware something may fail sometimes. Is there any way to make the
> sequencing more of a positive lock? Unfortunately that can be tricky, I’ve
> wound up doing things like serializing all tasks which…er…isn’t a good fix
> ;). But it sounds like your “overkill” section is along those lines?
>
> Up to you.
>
> Meanwhile, I’ve started beasting that test on my spare machine. If you’re
> curious about what that is:
> https://gist.github.com/markrmiller/dbdb792216dc98b018ad
>
> But don’t worry about that part. The point is I can run the test multiple
> times, with N running in parallel, hopefully making my machine exhibit the
> problem sometime over the night. If I can get it to fail before but not
> after your fix, it provides some reassurance that your fix is working. Not
> totally certain of course. Otherwise, we’ll just commit your fixes and see
> if Hoss’ rollups stop showing it.
>
> Thanks again!
> Erick
>
>
>
> > On May 30, 2020, at 1:42 PM, Ilan Ginzburg  wrote:
> >
> > MultiThreadedOCPTest
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Why MultiThreadedOCPTest (sometimes) fails

2020-05-30 Thread Erick Erickson
Ilan:

Please raise a new JIRA and attach any fixes there, a Git PR or patch whichever 
you prefer. Your analysis will get lost in the noise for SOLR-12801. And thanks 
for digging! We usually title them something like “harden MultiThreadedOCPTest” 
or “fix failing MultiThreadedOCPTest” or similar.

I haven’t really looked at the code you’re analyzing, it’s the weekend after 
all ;). So ignore this question if it makes no sense. Anyway, I’m always 
suspicious of using delays for just this kind of reason; depending on the 
hardware something may fail sometimes. Is there any way to make the sequencing 
more of a positive lock? Unfortunately that can be tricky, I’ve wound up doing 
things like serializing all tasks which…er…isn’t a good fix ;). But it sounds 
like your “overkill” section is along those lines? 

Up to you.

Meanwhile, I’ve started beasting that test on my spare machine. If you’re 
curious about what that is: 
https://gist.github.com/markrmiller/dbdb792216dc98b018ad

But don’t worry about that part. The point is I can run the test multiple 
times, with N running in parallel, hopefully making my machine exhibit the 
problem sometime over the night. If I can get it to fail before but not after 
your fix, it provides some reassurance that your fix is working. Not totally 
certain of course. Otherwise, we’ll just commit your fixes and see if Hoss’ 
rollups stop showing it.

Thanks again!
Erick



> On May 30, 2020, at 1:42 PM, Ilan Ginzburg  wrote:
> 
> MultiThreadedOCPTest


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Why MultiThreadedOCPTest (sometimes) fails

2020-05-30 Thread Ilan Ginzburg
Following Erick’s Bad  report, I looked at MultiThreadedOCPTest.test().
I've found a failure in testFillWorkQueue() in Jenkins logs (not able to
reproduce locally).

This test enqueues a large number of tasks (115, more than the 100
Collection API parallel executors) to the Collection API queue for a
collection COLL_A, then observes a short delay and enqueues a task for
another collection COLL_B.
It verifies that the COLL_B task (that does not require the same lock as
the COLL_A tasks) completes before the third (?) COLL_A task.

*Test failures happen for a disarmingly simple reason:* when enqueues are
slowed down enough, the first 3 tasks on COLL_A complete even before the
COLL_B task gets enqueued!

In the failed Jenkins test execution, the COLL_B task enqueue happened
1275ms after the enqueue of the first COLL_A, leaving plenty of time for a
few (and possibly all) COLL_A tasks to complete.

I suggest two changes (is adding a PR to SOLR-12801
 the right way to do it?
Will somebody merge it from there?):

   - Make the “blocking” COLL_A task longer to execute (increase from 1 to
   2 seconds) to compensate for slow enqueues. Hopefully 2 seconds is
   sufficient… If it’s not, we can increase it more later.
   - Verify the COLL_B task (a 1ms task) finishes before the *first* COLL_A
   task (the long running one) and not the 3rd. This would be a better
   indication that even though the collection queue was filled with tasks
   waiting for a busy lock, a non competing task was picked and executed right
   away.

There would still be a grey area: what if the enqueue of the COLL_B task
happened before the first COLL_A task even started to execute? If we wanted
to deal with that, we could enqueue all COLL_A tasks, make the second COLL_A
task long running (and not the first), enqueue the COLL_B task once the
first COLL_A task has completed and verify the COLL_B task completes before
the second (long running) COLL_A task. I believe that's slightly overkill
(yet easier to implement than to describe, so can include that as well in
the PR if deemed useful).

Ilan