[jira] [Comment Edited] (SOLR-15644) Add the ability to interrupt and wait for threads for problematic tests.

Mark Robert Miller (Jira) Thu, 23 Sep 2021 02:41:07 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-15644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419087#comment-17419087
 ]


Mark Robert Miller edited comment on SOLR-15644 at 9/23/21, 9:40 AM:
---------------------------------------------------------------------

I’m not sure where I’m being offensive. I don’t expect you to go fix all the 
Solr issues, it’s a ridiculous amount of time and effort, I can tell you Uwe 
isn’t going to do it either. And it’s not a knock on the test framework either. 
If I thought it should have to deal with Solr and all of its random 
dependencies and it’s problems in an ideal way for every case that Solr faces I 
would say it should be changed. I haven’t found anything I’ve tried to change, 
I use it as intended - it catches bad behavior, and if I can address that 
behavior I do. And you can in 100s of cases. And in many fewer cases you either 
cannot, or the effort is too large. For instances, many things can be address 
by making Solr handle interrupts properly and it making it close/shutdown 
properly. Both are absolutely huge endeavors, I have done it, it would still 
take forever to repeat.

I won’t get into all the issues because most of them apply when the tests 
milliseconds to seconds. When most of the tests are 10s of seconds to minutes, 
I don’t really care about the performance of moving from test to test. I’ll 
wait, linger, do rounds of interrupts, whatever. The broad linger and other 
waits are still detrimental because you lose the value of the framework letting 
you know what’s not right - things can be added now that cause almost every 
test to linger the full 10 seconds and then get interrupted and no one would 
even bat an eye or notice. But even that is not a big deal in the current world 
of things.

The only thing I have to deal with in this world is cases where you remove some 
of the exceptions and slowness of a test and it has objects / threads that you 
don’t want carrying on into other tests, but with some layers removed, the test 
framework will interrupt them, they won’t stop in time, and it will fail the 
test run. But I can stop them and not fail the run and do it relatively 
quickly. Not with any broad approach, but specifically for that problem 
resource.

The other items are no longer very interesting to me, but that are various 
cases. Sometimes there are items where if you just wait a very short time, it 
will close it very quickly, but if you hit it with an interrupt it will take 
much longer.

There are cases with the overseer where  if you hit it with an interrupt you 
may poke the bear and find it almost impossible to stop. Other cases where an 
interrupt gets you out of one layer of third party code, but not fully out and 
another quick interrupt or two will get you out. All of these cases are very 
individualized to mostly isolated objects / dependencies and if I’m doing 
things well, I don’t want any kind of broad behavior to deal with them - I want 
the framework to tightly control and fail everything I don’t have to 
individually work around as a last resort.

Anyway, 1000 of issues are there, it just depends on what you care about and 
what affects you. You could look at the big integration tests and say they are 
heavy, and so crank down the number of test jvms and say, Solr tests are often 
heavy, use less jvms than Lucene. And that might be the end of it and Solr will 
still search your data. You could also look at those test jvms when you start 
up 15 at once and see most of the heavy tests sitting there using 0-3% of cpu 
most of them time. You could then look into that and find 1000 things that if 
addressed make those tests run as fast or faster than tight no dependency 
Lucene tests. If you took the former approach, maybe there are not 1000 
problems, the tests are passing and you are searching your data. The second 
approach, you have a slightly different system when those huge integration 
tests are giving Lucene a hard time and actually using cpu and moving and 
exposing actual issues that are never seen when they sit around mostly hanging 
out. 1000 issues from one angle, no major issue the other. Which perspective 
I’m seeing depends on if I’m just collecting a pay check or want to be honest 
about what’s going on in front of me. Whether it’s a tangential piece of 
software in my daily life or a core piece. 


was (Author: markrmiller):
I’m not sure where I’m being offensive. I don’t expect you to go fix all the 
Solr issues, it’s a ridiculous amount of time and effort, I can tell you Uwe 
isn’t going to do it either. And it’s not a knock on the test framework either. 
If I thought it should have to deal with Solr and all of its random 
dependencies and it’s problems in an ideal way for every case that Solr faces I 
would say it should be changed. I haven’t found anything I’ve tried to change, 
I use it as intended - it catches bad behavior, and if I can’t address that 
behavior I do. And you can in 100s of cases. And in many fewer cases you either 
cannot, or the effort is too large. For instances, many things can be address 
by making Solr handle interrupts properly and it making it close/shutdown 
properly. Both are absolutely huge endeavors, I have done it, it would still 
take forever to repeat.

I won’t get into all the issues because most of them apply when the tests 
milliseconds to seconds. When most of the tests are 10s of seconds to minutes, 
I don’t really care about the performance of moving from test to test. I’ll 
wait, linger, do rounds of interrupts, whatever. The broad linger and other 
waits are still detrimental because you lose the value of the framework letting 
you know what’s not right - things can be added now that cause almost every 
test to linger the full 10 seconds and then get interrupted and no one would 
even bat an eye or notice. But even that is not a big deal in the current world 
of things.

The only thing I have to deal with in this world is cases where you remove some 
of the exceptions and slowness of a test and it has objects / threads that you 
don’t want carrying on into other tests, but with some layers removed, the test 
framework will interrupt them, they won’t stop in time, and it will fail the 
test run. But I can stop them and not fail the run and do it relatively 
quickly. Not with any broad approach, but specifically for that problem 
resource.

The other items are no longer very interesting to me, but that are various 
cases. Sometimes there are items where if you just wait a very short time, it 
will close it very quickly, but if you hit it with an interrupt it will take 
much longer.

There are cases with the overseer where  if you hit it with an interrupt you 
may poke the bear and find it almost impossible to stop. Other cases where an 
interrupt gets you out of one layer of third party code, but not fully out and 
another quick interrupt or two will get you out. All of these cases are very 
individualized to mostly isolated objects / dependencies and if I’m doing 
things well, I don’t want any kind of broad behavior to deal with them - I want 
the framework to tightly control and fail everything I don’t have to 
individually work around as a last resort.

Anyway, 1000 of issues are there, it just depends on what you care about and 
what affects you. You could look at the big integration tests and say they are 
heavy, and so crank down the number of test jvms and say, Solr tests are often 
heavy, use less jvms than Lucene. And that might be the end of it and Solr will 
still search your data. You could also look at those test jvms when you start 
up 15 at once and see most of the heavy tests sitting there using 0-3% of cpu 
most of them time. You could then look into that and find 1000 things that if 
addressed make those tests run as fast or faster than tight no dependency 
Lucene tests. If you took the former approach, maybe there are not 1000 
problems, the tests are passing and you are searching your data. The second 
approach, you have a slightly different system when those huge integration 
tests are giving Lucene a hard time and actually using cpu and moving and 
exposing actual issues that are never seen when they sit around mostly hanging 
out. 1000 issues from one angle, no major issue the other. Which perspective 
I’m seeing depends on if I’m just collecting a pay check or want to be honest 
about what’s going on in front of me. Whether it’s a tangential piece of 
software in my daily life or a core piece. 

> Add the ability to interrupt and wait for threads for problematic tests.
> ------------------------------------------------------------------------
>
>                 Key: SOLR-15644
>                 URL: https://issues.apache.org/jira/browse/SOLR-15644
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Tests
>            Reporter: Mark Robert Miller
>            Assignee: Mark Robert Miller
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The stuff in the test framework is slow and lacks control. For problematic 
> tests, you don't want to linger first and you want fine control around 
> interrupting - interrupting with a sledgehammer approach can actually make 
> things take longer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-15644) Add the ability to interrupt and wait for threads for problematic tests.

Reply via email to