[
https://issues.apache.org/jira/browse/SOLR-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mikhail Khludnev updated SOLR-9671:
-----------------------------------
Attachment:
TestMiniSolrCloudCluster-testCollectionCreateSearchDelete-fail-brief.txt
[^TestMiniSolrCloudCluster-testCollectionCreateSearchDelete-fail-brief.txt]
clarifies the case
bq. parallelCoreAdminExecutor-1321-thread-1 creates
testcollection_shard2_replica1
bq. parallelCoreAdminExecutor-1329-thread-1 creates
testcollection_shard2_replica2
but the parallelCoreAdminExecutor-1321-thread-1 (replica1) will never appear in
logs until death from OOME heap space
Anyway parallelCoreAdminExecutor-1329-thread-1 seems try to sync
shard2_replica2 with stalled shard2_replica1, and then give up
bq. o.a.s.c.ShardLeaderElectionContext We failed sync, but we have no versions
- we can't sync in that case - we were active before, so become leader anyway
but the problem is that it saturate heap with
{quote}
749534 ERROR (qtp1915946497-6736) [ ] o.a.s.s.HttpSolrCall
null:org.apache.solr.common.SolrException: Error trying to proxy request for
url: http://127.0.0.1:42320/solr/testcollection_shard2_replica1/get
[junit4] 2> at
org.apache.solr.servlet.HttpSolrCall.remoteQuery(HttpSolrCall.java:590)
[junit4] 2> at
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:444)
{quote}
for me it's strange that it issues "remoteQueries" ie talks to a replica
through other peers, and it's the only explanation why we have so many of them
hanging on read - it seems like two nodes calls each other until heap
saturation. WDYT?
> TestMiniSolrCloudCluster blowup jvm with remote /get requests
> -------------------------------------------------------------
>
> Key: SOLR-9671
> URL: https://issues.apache.org/jira/browse/SOLR-9671
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Mikhail Khludnev
> Labels: cloud
> Attachments:
> TestMiniSolrCloudCluster-testCollectionCreateSearchDelete-fail-brief.txt,
> TestMiniSolrCloudCluster-testCollectionCreateSearchDelete-fail.zip
>
>
> this is epic https://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Linux/1994/
> There is no many cores, I checked. It seems like cluster blow up when tries
> to launch after collection remove. Haven't tried to reproduce it locally
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]