[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-18 Thread Scott Blum (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10420:
--
Fix Version/s: (was: 6.4.3)

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Fix For: 5.5.5, 5.6, 6.5.1, 6.6, master (7.0)
>
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-18 Thread Scott Blum (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10420:
--
Fix Version/s: 5.6

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Fix For: 5.5.5, 5.6, 6.4.3, 6.5.1, 6.6, master (7.0)
>
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-18 Thread Scott Blum (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10420:
--
Fix Version/s: 6.6

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Fix For: 5.5.5, 6.4.3, 6.5.1, 6.6, master (7.0)
>
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-18 Thread Scott Blum (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10420:
--
Fix Version/s: master (7.0)
   6.5.1
   6.4.3
   5.5.5

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Fix For: 5.5.5, 6.4.3, 6.5.1, master (7.0)
>
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-17 Thread Scott Blum (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10420:
--
Attachment: (was: SOLR-10420-dragonsinth.patch)

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-17 Thread Scott Blum (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10420:
--
Attachment: SOLR-10420-dragonsinth.patch

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-17 Thread Scott Blum (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10420:
--
Attachment: SOLR-10420-dragonsinth.patch

[~caomanhdat] [~jhump] I think this may be the right approach after reviewing 
the overall design.  I don't see any real reason to specifically track 
lastWatcher, we just need to ensure that no more than one is ever set.  And 
having lastWatcher serve double-duty was a misdesign on my part.  There are 
really two separate stateful questions to answer:

1) Is there a watcher set?
2) Are we known to be dirty?

The answer to those two questions is not the same if we want to support 
same-thread synchronous offer -> poll working as you would want.  So this patch 
tracks them separately.

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-17 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-10420:

Attachment: SOLR-10420.patch

Latest patch, I would like not to reuse lastWatcher. It can come to this case
{code}
peek -> lastWatcher = resuseWatcher (1)
offer -> lastWatcher = null
peek -> lastWatcher = reuseWatcher (2)
(1) event -> lastWatcher = null
{code}

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-15 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-10420:

Attachment: SOLR-10420.patch

I have a discussion with Noble. It seems that DQ are not used in any places 
except Overseer. So I will go with solution #1.

Will beast the test in Steve machine tonight ( thanks [~steve_rowe] a lot )

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-13 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated SOLR-10420:
--
Attachment: OverseerTest.DEBUG.58.stdout
OverseerTest.DEBUG.48.stdout
OverseerTest.DEBUG.43.stdout

I got 5 failures out of 100, attaching 3 of them here: 
[^OverseerTest.DEBUG.43.stdout], [^OverseerTest.DEBUG.48.stdout], 
[^OverseerTest.DEBUG.58.stdout]

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.5, 6.4.2
>Reporter: Markus Jelsma
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-13 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated SOLR-10420:
--
Attachment: OverseerTest.119.stdout
OverseerTest.106.stdout

I beasted the latest patch (the one without the increased OverseerTest 
timeouts) for 200 iterations and got 2 failures - I've attached their logs: 
[^OverseerTest.106.stdout] and [^OverseerTest.119.stdout].

Next I'll try the patch with the OverseerTest changes.

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.5, 6.4.2
>Reporter: Markus Jelsma
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-12 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-10420:

Attachment: SOLR-10420.patch

A patch for this ticket. In this patch, we reuse the ChildWatcher so in any 
case ( race conditions ) we always reach the line
{{ lastWatcher = null }}

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.5, 6.4.2
>Reporter: Markus Jelsma
> Attachments: OverseerTest.80.stdout, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-12 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-10420:

Attachment: SOLR-10420.patch

[~steve_rowe] I think this is problem of the test. Can you run the test with 
this patch ( I increased the amount of time waiting for replica become active ).

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.5, 6.4.2
>Reporter: Markus Jelsma
> Attachments: OverseerTest.80.stdout, SOLR-10420.patch, 
> SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-12 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated SOLR-10420:
--
Affects Version/s: 5.5.2

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.5, 6.4.2
>Reporter: Markus Jelsma
> Attachments: OverseerTest.80.stdout, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-12 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated SOLR-10420:
--
Fix Version/s: (was: branch_6x)
   (was: master (7.0))

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.5, 6.4.2
>Reporter: Markus Jelsma
> Attachments: OverseerTest.80.stdout, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-12 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated SOLR-10420:
--
Attachment: OverseerTest.80.stdout

I ran all Solr tests with the patch on master, and one test failed: 

{noformat}
   [junit4]   2> 264992 ERROR (OverseerExitThread) [] o.a.s.c.Overseer 
could not read the data
   [junit4]   2> org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /overseer_elect/leader
   [junit4]   2>at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
   [junit4]   2>at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   [junit4]   2>at 
org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
   [junit4]   2>at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:356)
   [junit4]   2>at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:353)
   [junit4]   2>at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
   [junit4]   2>at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:353)
   [junit4]   2>at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:290)
   [junit4]   2>at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=OverseerTest 
-Dtests.method=testExternalClusterStateChangeBehavior 
-Dtests.seed=2110CE0AEF674CFA -Dtests.slow=true -Dtests.locale=es-GT 
-Dtests.timezone=Asia/Kolkata -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 5.46s J12 | 
OverseerTest.testExternalClusterStateChangeBehavior <<<
   [junit4]> Throwable #1: java.lang.AssertionError: Illegal state, was: 
down expected:active clusterState:live 
nodes:[]collections:{c1=DocCollection(c1//clusterstate.json/2)={
   [junit4]>   "shards":{"shard1":{
   [junit4]>   "parent":null,
   [junit4]>   "range":null,
   [junit4]>   "state":"active",
   [junit4]>   "replicas":{"core_node1":{
   [junit4]>   "base_url":"http://127.0.0.1/solr;,
   [junit4]>   "node_name":"node1",
   [junit4]>   "core":"core1",
   [junit4]>   "roles":"",
   [junit4]>   "state":"down",
   [junit4]>   "router":{"name":"implicit"}}, test=LazyCollectionRef(test)}
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([2110CE0AEF674CFA:490ECDE60DF716B4]:0)
   [junit4]>at 
org.apache.solr.cloud.AbstractDistribZkTestBase.verifyReplicaStatus(AbstractDistribZkTestBase.java:273)
   [junit4]>at 
org.apache.solr.cloud.OverseerTest.testExternalClusterStateChangeBehavior(OverseerTest.java:1259)
{noformat}

I ran the repro line a couple of times and it didn't reproduce.  I then beasted 
100 iterations of the test suite using Miller's beasting script, and it failed 
once.  I'm attaching the test log from the failure.

Looking at emailed Jenkins reports of 
{{testExternalClusterStateChangeBehavior()}} failing, I see that it was failing 
almost daily until the day SOLR-9191 was committed (June 9, 2016), and then 
zero failures since, so this failure seems suspicious to me, since this issue 
is related to SOLR-9191.

I beasted 200 iterations of OverseerTest without the patch, and got zero 
failures.

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.5, 6.4.2
>Reporter: Markus Jelsma
> Fix For: master (7.0), branch_6x
>
> Attachments: OverseerTest.80.stdout, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-12 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-10420:

Attachment: SOLR-10420.patch

Patch for this ticket. This problem was introduced by SOLR-9191. Serious 
problem for Solr 6.x

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.5, 6.4.2
>Reporter: Markus Jelsma
> Fix For: master (7.0), branch_6x
>
> Attachments: SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org