[ 
https://issues.apache.org/jira/browse/SOLR-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated SOLR-9671:
-----------------------------------
    Attachment: 
TestMiniSolrCloudCluster-testCollectionCreateSearchDelete-fail-brief.txt

[^TestMiniSolrCloudCluster-testCollectionCreateSearchDelete-fail-brief.txt] 
clarifies the case

bq. parallelCoreAdminExecutor-1321-thread-1 creates 
testcollection_shard2_replica1
bq.  parallelCoreAdminExecutor-1329-thread-1 creates  
testcollection_shard2_replica2 

but the parallelCoreAdminExecutor-1321-thread-1 (replica1) will never appear in 
logs until death from OOME heap space 
Anyway parallelCoreAdminExecutor-1329-thread-1 seems try to sync 
shard2_replica2 with stalled shard2_replica1, and then give up 
bq. o.a.s.c.ShardLeaderElectionContext We failed sync, but we have no versions 
- we can't sync in that case - we were active before, so become leader anyway

 but the problem is that it saturate heap with 
{quote}
749534 ERROR (qtp1915946497-6736) [    ] o.a.s.s.HttpSolrCall 
null:org.apache.solr.common.SolrException: Error trying to proxy request for 
url: http://127.0.0.1:42320/solr/testcollection_shard2_replica1/get
   [junit4]   2>        at 
org.apache.solr.servlet.HttpSolrCall.remoteQuery(HttpSolrCall.java:590)
   [junit4]   2>        at 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:444)
{quote}

for me it's strange that it issues "remoteQueries" ie talks to a replica 
through other peers, and it's the only explanation why we have so many of them 
hanging on read - it seems like two nodes calls each other until heap 
saturation. WDYT?

> TestMiniSolrCloudCluster blowup jvm with remote /get requests
> -------------------------------------------------------------
>
>                 Key: SOLR-9671
>                 URL: https://issues.apache.org/jira/browse/SOLR-9671
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Mikhail Khludnev
>              Labels: cloud
>         Attachments: 
> TestMiniSolrCloudCluster-testCollectionCreateSearchDelete-fail-brief.txt, 
> TestMiniSolrCloudCluster-testCollectionCreateSearchDelete-fail.zip
>
>
> this is epic https://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Linux/1994/
> There is no many cores, I checked. It seems like cluster blow up when tries 
> to launch after collection remove. Haven't tried to reproduce it locally 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to