[jira] [Commented] (SOLR-16414) Race condition in PRS state updates

Michael Gibney (Jira) Fri, 04 Nov 2022 08:36:04 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-16414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629050#comment-17629050
 ]


Michael Gibney commented on SOLR-16414:
---------------------------------------

Thank you for the in-depth analysis, Patson! And thanks for the thread dump and 
for identifying this issue, Ishan and Noble. Would you be able to provide logs 
for the shutdown as well?

{quote} We actually do not need any parallelism.  The operations are quite 
fast{quote}

iiuc everyone's in agreement on that point; but the way this manifested doesn't 
look like it's simply related to concurrent load induced by using 
{{parallelStream}} instead of serial {{forEach}}. On the one hand this 
hopefully reassures [~janhoy] that this fix isn't simply a matter of throttling 
load in an arbitrary way -- it's actually a consequence of the behavior of 
{{parallelStream}} in a way unrelated to parallelism _per se_. On the other 
hand, this may have uncovered a latent issue, perhaps around exception 
handling/ordering assumptions in the shutdown code, warranting digging a bit 
further to figure out more specifically what's going on, and if there may be 
other changes that could guard against this kind of thing happening in the 
future. 

Patson's analysis definitely seems relevant, but the thread dump Ishan posted 
seems to point at something else possibly going on. What I find curious about 
the thread dump is that it doesn't actually look like resource contention at 
this point; rather, it looks like a bunch of non-daemon threads somehow got 
created _after_ the shutdown process considered itself to be finished, and the 
non-daemon threads are preventing the JVM from exiting, despite the fact that 
the shutdown hook has exited and no more work is actually being done.

It's possible I'm misreading the situation, but fwiw that hypothetical 
situation could potentially be a consequence of the behavior Patson outlined: 
could the tasks executed by parallelStream somehow re-instantiate 
"searcherExecutor" and "parallelCoreAdminExecutor" thread pools _after_ the 
point when the shutdown process would consider the need to shut them down?

> Race condition in PRS state updates
> -----------------------------------
>
>                 Key: SOLR-16414
>                 URL: https://issues.apache.org/jira/browse/SOLR-16414
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Noble Paul
>            Assignee: Noble Paul
>            Priority: Major
>             Fix For: 9.1
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> For PRS collections the individual states are potentially updated from 
> individual nodes and sometimes from overseer too. it's possible that
>  
>  # OP1 is sent to overseer at T1
>  # OP2 is executed in the node itself at T2
>  
> Because we cannot guarantee that the OP1 sent to overseer may execute before 
> OP2 tyhe final state will be the result of OP1 which is incorrect and can 
> lead to errors .
> The solution is to never do any PRS writes from overseer. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-16414) Race condition in PRS state updates

Reply via email to