[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Blum updated SOLR-10420: -- Fix Version/s: (was: 6.4.3) > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.2, 6.4.2, 6.5 >Reporter: Markus Jelsma >Assignee: Scott Blum > Fix For: 5.5.5, 5.6, 6.5.1, 6.6, master (7.0) > > Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, > OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, > OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, > SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, > SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Blum updated SOLR-10420: -- Fix Version/s: 5.6 > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.2, 6.4.2, 6.5 >Reporter: Markus Jelsma >Assignee: Scott Blum > Fix For: 5.5.5, 5.6, 6.4.3, 6.5.1, 6.6, master (7.0) > > Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, > OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, > OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, > SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, > SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Blum updated SOLR-10420: -- Fix Version/s: 6.6 > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.2, 6.4.2, 6.5 >Reporter: Markus Jelsma >Assignee: Scott Blum > Fix For: 5.5.5, 6.4.3, 6.5.1, 6.6, master (7.0) > > Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, > OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, > OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, > SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, > SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Blum updated SOLR-10420: -- Fix Version/s: master (7.0) 6.5.1 6.4.3 5.5.5 > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.2, 6.4.2, 6.5 >Reporter: Markus Jelsma >Assignee: Scott Blum > Fix For: 5.5.5, 6.4.3, 6.5.1, master (7.0) > > Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, > OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, > OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, > SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, > SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Blum updated SOLR-10420: -- Attachment: (was: SOLR-10420-dragonsinth.patch) > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.2, 6.4.2, 6.5 >Reporter: Markus Jelsma >Assignee: Scott Blum > Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, > OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, > OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, > SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, > SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Blum updated SOLR-10420: -- Attachment: SOLR-10420-dragonsinth.patch > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.2, 6.4.2, 6.5 >Reporter: Markus Jelsma >Assignee: Scott Blum > Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, > OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, > OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, > SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, > SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Blum updated SOLR-10420: -- Attachment: SOLR-10420-dragonsinth.patch [~caomanhdat] [~jhump] I think this may be the right approach after reviewing the overall design. I don't see any real reason to specifically track lastWatcher, we just need to ensure that no more than one is ever set. And having lastWatcher serve double-duty was a misdesign on my part. There are really two separate stateful questions to answer: 1) Is there a watcher set? 2) Are we known to be dirty? The answer to those two questions is not the same if we want to support same-thread synchronous offer -> poll working as you would want. So this patch tracks them separately. > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.2, 6.4.2, 6.5 >Reporter: Markus Jelsma > Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, > OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, > OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, > SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, > SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cao Manh Dat updated SOLR-10420: Attachment: SOLR-10420.patch Latest patch, I would like not to reuse lastWatcher. It can come to this case {code} peek -> lastWatcher = resuseWatcher (1) offer -> lastWatcher = null peek -> lastWatcher = reuseWatcher (2) (1) event -> lastWatcher = null {code} > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.2, 6.4.2, 6.5 >Reporter: Markus Jelsma > Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, > OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, > OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, SOLR-10420.patch, > SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cao Manh Dat updated SOLR-10420: Attachment: SOLR-10420.patch I have a discussion with Noble. It seems that DQ are not used in any places except Overseer. So I will go with solution #1. Will beast the test in Steve machine tonight ( thanks [~steve_rowe] a lot ) > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.2, 6.4.2, 6.5 >Reporter: Markus Jelsma > Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, > OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, > OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, SOLR-10420.patch, > SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated SOLR-10420: -- Attachment: OverseerTest.DEBUG.58.stdout OverseerTest.DEBUG.48.stdout OverseerTest.DEBUG.43.stdout I got 5 failures out of 100, attaching 3 of them here: [^OverseerTest.DEBUG.43.stdout], [^OverseerTest.DEBUG.48.stdout], [^OverseerTest.DEBUG.58.stdout] > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.2, 6.5, 6.4.2 >Reporter: Markus Jelsma > Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, > OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, > OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, SOLR-10420.patch, > SOLR-10420.patch, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated SOLR-10420: -- Attachment: OverseerTest.119.stdout OverseerTest.106.stdout I beasted the latest patch (the one without the increased OverseerTest timeouts) for 200 iterations and got 2 failures - I've attached their logs: [^OverseerTest.106.stdout] and [^OverseerTest.119.stdout]. Next I'll try the patch with the OverseerTest changes. > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.2, 6.5, 6.4.2 >Reporter: Markus Jelsma > Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, > OverseerTest.80.stdout, SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cao Manh Dat updated SOLR-10420: Attachment: SOLR-10420.patch A patch for this ticket. In this patch, we reuse the ChildWatcher so in any case ( race conditions ) we always reach the line {{ lastWatcher = null }} > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.2, 6.5, 6.4.2 >Reporter: Markus Jelsma > Attachments: OverseerTest.80.stdout, SOLR-10420.patch, > SOLR-10420.patch, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cao Manh Dat updated SOLR-10420: Attachment: SOLR-10420.patch [~steve_rowe] I think this is problem of the test. Can you run the test with this patch ( I increased the amount of time waiting for replica become active ). > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.2, 6.5, 6.4.2 >Reporter: Markus Jelsma > Attachments: OverseerTest.80.stdout, SOLR-10420.patch, > SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated SOLR-10420: -- Affects Version/s: 5.5.2 > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 5.5.2, 6.5, 6.4.2 >Reporter: Markus Jelsma > Attachments: OverseerTest.80.stdout, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated SOLR-10420: -- Fix Version/s: (was: branch_6x) (was: master (7.0)) > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.5, 6.4.2 >Reporter: Markus Jelsma > Attachments: OverseerTest.80.stdout, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated SOLR-10420: -- Attachment: OverseerTest.80.stdout I ran all Solr tests with the patch on master, and one test failed: {noformat} [junit4] 2> 264992 ERROR (OverseerExitThread) [] o.a.s.c.Overseer could not read the data [junit4] 2> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/leader [junit4] 2>at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) [junit4] 2>at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) [junit4] 2>at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) [junit4] 2>at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:356) [junit4] 2>at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:353) [junit4] 2>at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) [junit4] 2>at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:353) [junit4] 2>at org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:290) [junit4] 2>at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=OverseerTest -Dtests.method=testExternalClusterStateChangeBehavior -Dtests.seed=2110CE0AEF674CFA -Dtests.slow=true -Dtests.locale=es-GT -Dtests.timezone=Asia/Kolkata -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 5.46s J12 | OverseerTest.testExternalClusterStateChangeBehavior <<< [junit4]> Throwable #1: java.lang.AssertionError: Illegal state, was: down expected:active clusterState:live nodes:[]collections:{c1=DocCollection(c1//clusterstate.json/2)={ [junit4]> "shards":{"shard1":{ [junit4]> "parent":null, [junit4]> "range":null, [junit4]> "state":"active", [junit4]> "replicas":{"core_node1":{ [junit4]> "base_url":"http://127.0.0.1/solr;, [junit4]> "node_name":"node1", [junit4]> "core":"core1", [junit4]> "roles":"", [junit4]> "state":"down", [junit4]> "router":{"name":"implicit"}}, test=LazyCollectionRef(test)} [junit4]>at __randomizedtesting.SeedInfo.seed([2110CE0AEF674CFA:490ECDE60DF716B4]:0) [junit4]>at org.apache.solr.cloud.AbstractDistribZkTestBase.verifyReplicaStatus(AbstractDistribZkTestBase.java:273) [junit4]>at org.apache.solr.cloud.OverseerTest.testExternalClusterStateChangeBehavior(OverseerTest.java:1259) {noformat} I ran the repro line a couple of times and it didn't reproduce. I then beasted 100 iterations of the test suite using Miller's beasting script, and it failed once. I'm attaching the test log from the failure. Looking at emailed Jenkins reports of {{testExternalClusterStateChangeBehavior()}} failing, I see that it was failing almost daily until the day SOLR-9191 was committed (June 9, 2016), and then zero failures since, so this failure seems suspicious to me, since this issue is related to SOLR-9191. I beasted 200 iterations of OverseerTest without the patch, and got zero failures. > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.5, 6.4.2 >Reporter: Markus Jelsma > Fix For: master (7.0), branch_6x > > Attachments: OverseerTest.80.stdout, SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second
[ https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cao Manh Dat updated SOLR-10420: Attachment: SOLR-10420.patch Patch for this ticket. This problem was introduced by SOLR-9191. Serious problem for Solr 6.x > Solr 6.x leaking one SolrZkClient instance per second > - > > Key: SOLR-10420 > URL: https://issues.apache.org/jira/browse/SOLR-10420 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 6.5, 6.4.2 >Reporter: Markus Jelsma > Fix For: master (7.0), branch_6x > > Attachments: SOLR-10420.patch > > > One of our nodes became berzerk after a restart, Solr went completely nuts! > So i opened VisualVM to keep an eye on it and spotted a different problem > that occurs in all our Solr 6.4.2 and 6.5.0 nodes. > It appears Solr is leaking one SolrZkClient instance per second via > DistributedQueue$ChildWatcher. That one per second is quite accurate for all > nodes, there are about the same amount of instances as there are seconds > since Solr started. I know VisualVM's instance count includes > objects-to-be-collected, the instance count does not drop after a forced > garbed collection round. > It doesn't matter how many cores or collections the nodes carry or how heavy > traffic is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org