from:"Scott Blum \(JIRA\)"

[jira] [Commented] (SOLR-13320) add a param ignoreDuplicates=true to updates to not overwrite existing docs

2019-04-25 Thread Scott Blum (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826483#comment-16826483
 ] 

Scott Blum commented on SOLR-13320:
---

+1!

> add a param ignoreDuplicates=true to updates to not overwrite existing docs
> ---
>
> Key: SOLR-13320
> URL: https://issues.apache.org/jira/browse/SOLR-13320
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>
> Updates should have an option to ignore duplicate documents and drop them if 
> an option  {{ignoreDuplicates=true}} is specified



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-13320) add a param ignoreDuplicates=true to updates to not overwrite existing docs

2019-03-15 Thread Scott Blum (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16794015#comment-16794015
 ] 

Scott Blum commented on SOLR-13320:
---

Shalin lemme break this down a bit...

Imagine you're restoring a collection from a backup, but you want to be able to 
accept writes while this is in progress.  You start accepting writes (of new 
data) on the new, empty collection, then in the background you want to backfill 
from your backup copy, but you don't want to overwrite anything that has been 
written recently.

Setting "version:-1" on all the incoming, backfill doc is almost what you 
want-- add any documents that don't exist, but don't overwrite any documents 
that do exist.  The problem is that the entire batch gets rejected if even one 
document already exists.  We just want a way to be able to ignore conflicts and 
quietly drop the offending documents rather than rejecting the entire batch.

"ignoreConflicts" might be a better name.

> add a param ignoreDuplicates=true to updates to not overwrite existing docs
> ---
>
> Key: SOLR-13320
> URL: https://issues.apache.org/jira/browse/SOLR-13320
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>
> Updates should have an option to ignore duplicate documents and drop them if 
> an option  {{ignoreDuplicates=true}} is specified



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-13320) add a param ignoreDuplicates=true to updates to not overwrite existing docs

2019-03-15 Thread Scott Blum (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16794015#comment-16794015
 ] 

Scott Blum edited comment on SOLR-13320 at 3/15/19 11:19 PM:
-

[~shalinmangar] lemme break this down a bit...

Imagine you're restoring a collection from a backup, but you want to be able to 
accept writes while this is in progress.  You start accepting writes (of new 
data) on the new, empty collection, then in the background you want to backfill 
from your backup copy, but you don't want to overwrite anything that has been 
written recently.

Setting "version:-1" on all the incoming, backfill doc is almost what you want: 
add any documents that don't exist, but don't overwrite any documents that do 
exist.  The problem is that the entire batch gets rejected if even one document 
already exists.  We just want a way to be able to ignore conflicts and quietly 
drop the offending documents rather than rejecting the entire batch.

"ignoreConflicts" might be a better name.


was (Author: dragonsinth):
[~shalinmangar] lemme break this down a bit...

Imagine you're restoring a collection from a backup, but you want to be able to 
accept writes while this is in progress.  You start accepting writes (of new 
data) on the new, empty collection, then in the background you want to backfill 
from your backup copy, but you don't want to overwrite anything that has been 
written recently.

Setting "version:-1" on all the incoming, backfill doc is almost what you 
want-- add any documents that don't exist, but don't overwrite any documents 
that do exist.  The problem is that the entire batch gets rejected if even one 
document already exists.  We just want a way to be able to ignore conflicts and 
quietly drop the offending documents rather than rejecting the entire batch.

"ignoreConflicts" might be a better name.

> add a param ignoreDuplicates=true to updates to not overwrite existing docs
> ---
>
> Key: SOLR-13320
> URL: https://issues.apache.org/jira/browse/SOLR-13320
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>
> Updates should have an option to ignore duplicate documents and drop them if 
> an option  {{ignoreDuplicates=true}} is specified



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-13320) add a param ignoreDuplicates=true to updates to not overwrite existing docs

2019-03-15 Thread Scott Blum (JIRA)



[ 
https://issues.apache.org/jira/browse/SOLR-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16794015#comment-16794015
 ] 

Scott Blum edited comment on SOLR-13320 at 3/15/19 11:19 PM:
-

[~shalinmangar] lemme break this down a bit...

Imagine you're restoring a collection from a backup, but you want to be able to 
accept writes while this is in progress.  You start accepting writes (of new 
data) on the new, empty collection, then in the background you want to backfill 
from your backup copy, but you don't want to overwrite anything that has been 
written recently.

Setting "version:-1" on all the incoming, backfill doc is almost what you 
want-- add any documents that don't exist, but don't overwrite any documents 
that do exist.  The problem is that the entire batch gets rejected if even one 
document already exists.  We just want a way to be able to ignore conflicts and 
quietly drop the offending documents rather than rejecting the entire batch.

"ignoreConflicts" might be a better name.


was (Author: dragonsinth):
Shalin lemme break this down a bit...

Imagine you're restoring a collection from a backup, but you want to be able to 
accept writes while this is in progress.  You start accepting writes (of new 
data) on the new, empty collection, then in the background you want to backfill 
from your backup copy, but you don't want to overwrite anything that has been 
written recently.

Setting "version:-1" on all the incoming, backfill doc is almost what you 
want-- add any documents that don't exist, but don't overwrite any documents 
that do exist.  The problem is that the entire batch gets rejected if even one 
document already exists.  We just want a way to be able to ignore conflicts and 
quietly drop the offending documents rather than rejecting the entire batch.

"ignoreConflicts" might be a better name.

> add a param ignoreDuplicates=true to updates to not overwrite existing docs
> ---
>
> Key: SOLR-13320
> URL: https://issues.apache.org/jira/browse/SOLR-13320
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>
> Updates should have an option to ignore duplicate documents and drop them if 
> an option  {{ignoreDuplicates=true}} is specified



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-5180) Core admin reload for an invalid core name returns 500 rather than 400 status code

2018-08-15 Thread Scott Blum (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum resolved SOLR-5180.
--
Resolution: Fixed

This is now fixed

> Core admin reload for an invalid core name returns 500 rather than 400 status 
> code
> --
>
> Key: SOLR-5180
> URL: https://issues.apache.org/jira/browse/SOLR-5180
> Project: Solr
>  Issue Type: Bug
>  Components: multicore
>Affects Versions: 4.4
>Reporter: Jack Krupansky
>Assignee: Scott Blum
>Priority: Major
>
> A core admin request to reload a nonexistent core name returns a 500 Server 
> Error rather than a 400 Invalid Request status code.
> The request: 
> {code}
> curl 
> "http://localhost:8983/solr/admin/cores?action=reload=bogus=true;
> {code}
> The response:
> {code}
> 
> 
> 
>   500
>   5
> 
> 
>   Error handling 'reload' action
>   org.apache.solr.common.SolrException: Error handling 
> 'reload' action
> at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:673)
> at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:172)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:655)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:246)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:368)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
> at 
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
> at 
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at 
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.solr.common.SolrException: Unable to reload core: bogus
> at 
> org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:930)
> at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:685)
> at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:671)
> ... 30 more
> Caused by: org.apache.solr.common.SolrException: No such core: bogus
> at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:636)
> ... 31 more
> 
>   500
> 
> 
> {code}
> The code at CoreContainer.reload(CoreContainer.java:636) correctly throws a 
> Solr Bad Request Exception:
> {Code}
>   if (core == null)
> throw new SolrException( SolrException.ErrorCode.BAD_REQUEST, "No 
> such core: " + name

[jira] [Assigned] (SOLR-5180) Core admin reload for an invalid core name returns 500 rather than 400 status code

2018-08-15 Thread Scott Blum (JIRA)



 [ 
https://issues.apache.org/jira/browse/SOLR-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum reassigned SOLR-5180:


Assignee: Scott Blum

> Core admin reload for an invalid core name returns 500 rather than 400 status 
> code
> --
>
> Key: SOLR-5180
> URL: https://issues.apache.org/jira/browse/SOLR-5180
> Project: Solr
>  Issue Type: Bug
>  Components: multicore
>Affects Versions: 4.4
>Reporter: Jack Krupansky
>Assignee: Scott Blum
>Priority: Major
>
> A core admin request to reload a nonexistent core name returns a 500 Server 
> Error rather than a 400 Invalid Request status code.
> The request: 
> {code}
> curl 
> "http://localhost:8983/solr/admin/cores?action=reload=bogus=true;
> {code}
> The response:
> {code}
> 
> 
> 
>   500
>   5
> 
> 
>   Error handling 'reload' action
>   org.apache.solr.common.SolrException: Error handling 
> 'reload' action
> at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:673)
> at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:172)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:655)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:246)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:368)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
> at 
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
> at 
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at 
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at 
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.solr.common.SolrException: Unable to reload core: bogus
> at 
> org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:930)
> at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:685)
> at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:671)
> ... 30 more
> Caused by: org.apache.solr.common.SolrException: No such core: bogus
> at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:636)
> ... 31 more
> 
>   500
> 
> 
> {code}
> The code at CoreContainer.reload(CoreContainer.java:636) correctly throws a 
> Solr Bad Request Exception:
> {Code}
>   if (core == null)
> throw new SolrException( SolrException.ErrorCode.BAD_REQUEST, "No 
> such core: " + name );
>

[jira] [Commented] (SOLR-8327) SolrDispatchFilter is not caching new state format, which results in live fetch from ZK per request if node does not contain core from collection

2017-12-19 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-8327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297371#comment-16297371
 ] 

Scott Blum commented on SOLR-8327:
--

https://github.com/apache/lucene-solr/pull/294

> SolrDispatchFilter is not caching new state format, which results in live 
> fetch from ZK per request if node does not contain core from collection
> -
>
> Key: SOLR-8327
> URL: https://issues.apache.org/jira/browse/SOLR-8327
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.3
>Reporter: Jessica Cheng Mallet
>Assignee: Varun Thacker
>  Labels: solrcloud
> Attachments: SOLR-8327.patch
>
>
> While perf testing with non-solrj client (request can be sent to any solr 
> node), we noticed a huge amount of data from Zookeeper in our tcpdump (~1G 
> for 20 second dump). From the thread dump, we noticed this:
> java.lang.Object.wait (Native Method)
> java.lang.Object.wait (Object.java:503)
> org.apache.zookeeper.ClientCnxn.submitRequest (ClientCnxn.java:1309)
> org.apache.zookeeper.ZooKeeper.getData (ZooKeeper.java:1152)
> org.apache.solr.common.cloud.SolrZkClient$7.execute (SolrZkClient.java:345)
> org.apache.solr.common.cloud.SolrZkClient$7.execute (SolrZkClient.java:342)
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation 
> (ZkCmdExecutor.java:61)
> org.apache.solr.common.cloud.SolrZkClient.getData (SolrZkClient.java:342)
> org.apache.solr.common.cloud.ZkStateReader.getCollectionLive 
> (ZkStateReader.java:841)
> org.apache.solr.common.cloud.ZkStateReader$7.get (ZkStateReader.java:515)
> org.apache.solr.common.cloud.ClusterState.getCollectionOrNull 
> (ClusterState.java:175)
> org.apache.solr.common.cloud.ClusterState.getLeader (ClusterState.java:98)
> org.apache.solr.servlet.HttpSolrCall.getCoreByCollection 
> (HttpSolrCall.java:784)
> org.apache.solr.servlet.HttpSolrCall.init (HttpSolrCall.java:272)
> org.apache.solr.servlet.HttpSolrCall.call (HttpSolrCall.java:417)
> org.apache.solr.servlet.SolrDispatchFilter.doFilter 
> (SolrDispatchFilter.java:210)
> org.apache.solr.servlet.SolrDispatchFilter.doFilter 
> (SolrDispatchFilter.java:179)
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter 
> (ServletHandler.java:1652)
> org.eclipse.jetty.servlet.ServletHandler.doHandle (ServletHandler.java:585)
> org.eclipse.jetty.server.handler.ScopedHandler.handle (ScopedHandler.java:143)
> org.eclipse.jetty.security.SecurityHandler.handle (SecurityHandler.java:577)
> org.eclipse.jetty.server.session.SessionHandler.doHandle 
> (SessionHandler.java:223)
> org.eclipse.jetty.server.handler.ContextHandler.doHandle 
> (ContextHandler.java:1127)
> org.eclipse.jetty.servlet.ServletHandler.doScope (ServletHandler.java:515)
> org.eclipse.jetty.server.session.SessionHandler.doScope 
> (SessionHandler.java:185)
> org.eclipse.jetty.server.handler.ContextHandler.doScope 
> (ContextHandler.java:1061)
> org.eclipse.jetty.server.handler.ScopedHandler.handle (ScopedHandler.java:141)
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle 
> (ContextHandlerCollection.java:215)
> org.eclipse.jetty.server.handler.HandlerCollection.handle 
> (HandlerCollection.java:110)
> org.eclipse.jetty.server.handler.HandlerWrapper.handle 
> (HandlerWrapper.java:97)
> org.eclipse.jetty.server.Server.handle (Server.java:499)
> org.eclipse.jetty.server.HttpChannel.handle (HttpChannel.java:310)
> org.eclipse.jetty.server.HttpConnection.onFillable (HttpConnection.java:257)
> org.eclipse.jetty.io.AbstractConnection$2.run (AbstractConnection.java:540)
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob 
> (QueuedThreadPool.java:635)
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run 
> (QueuedThreadPool.java:555)
> java.lang.Thread.run (Thread.java:745)
> Looks like SolrDispatchFilter doesn't have caching similar to the 
> collectionStateCache in CloudSolrClient, so if the node doesn't know about a 
> collection in the new state format, it just live-fetch it from Zookeeper on 
> every request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-12-09 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum resolved SOLR-11423.
---
   Resolution: Fixed
Fix Version/s: master (8.0)
   7.2

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
> Fix For: 7.2, master (8.0)
>
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-12-08 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284021#comment-16284021
 ] 

Scott Blum edited comment on SOLR-11423 at 12/8/17 6:52 PM:


I'm fixing solr/CHANGES.txt while cherry-picking into 7x and 7_2.  I'll push a 
change to master to move it when that's done.

Note: the solr/CHANGES.txt that shipped with 7.1 does not erroneously report 
this bugfix being included, it's just master/7x/7_2 that has it wrong.



was (Author: dragonsinth):
I'm fixing solr/CHANGES.txt while cherry-picking into 7x and 7_2.  I'll push a 
change to master to move it when that's done.

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-12-08 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284021#comment-16284021
 ] 

Scott Blum commented on SOLR-11423:
---

I'm fixing solr/CHANGES.txt while cherry-picking into 7x and 7_2.  I'll push a 
change to master to move it when that's done.

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-12-08 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283972#comment-16283972
 ] 

Scott Blum edited comment on SOLR-11423 at 12/8/17 6:21 PM:


Sounds good to me.  So backport to branch_7_2 and branch_7x?


was (Author: dragonsinth):
Sounds good to me.

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-12-08 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283972#comment-16283972
 ] 

Scott Blum commented on SOLR-11423:
---

Sounds good to me.

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-12-07 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282711#comment-16282711
 ] 

Scott Blum commented on SOLR-11423:
---

I didn't resolve due to Noble's #comment-16203208 but I have no objection to 
resolving.  I dropped it on master because I wasn't sure what branches we'd 
want to backport to.

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11590) Synchronize ZK connect/disconnect handling

2017-11-03 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238276#comment-16238276
 ] 

Scott Blum commented on SOLR-11590:
---

LGTM

> Synchronize ZK connect/disconnect handling
> --
>
> Key: SOLR-11590
> URL: https://issues.apache.org/jira/browse/SOLR-11590
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Noble Paul
>Priority: Major
> Attachments: SOLR-11590.patch
>
>
> Here is a sequence of 2 disconnects and re-connects
> {code}
> 1. 2017-10-31T08:34:23.106-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:Disconnected type:None path:null path:null type:None
> 2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected
> 3. 2017-10-31T08:34:23.107-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:SyncConnected type:None path:null path:null type:None
> {code}
> {code}
> 1. 2017-10-31T08:36:46.541-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:Disconnected type:None path:null path:null type:None
> 2. 2017-10-31T08:36:46.549-0700 Watcher 
> org.apache.solr.common.cloud.ConnectionManager@1579ca20 
> name:ZooKeeperConnection Watcher:host:port got event WatchedEvent 
> state:SyncConnected type:None path:null path:null type:None
> 2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected
> {code}
> In the first disconnect the sequence is -  get disconnect watcher, execute 
> disconnect code, execute connect code
> In the second disconnect the sequence is - get disconnect watcher, execute 
> connect code, execute disconnect code
> In the second sequence of events, if the JVM has leader replicas then all 
> updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." 
> . This starts happening exactly after 27 seconds ( zk client timeout is 30s , 
> 90% of 30 = 27 - when the code thinks the session is likely expired). No 
> leadership changes since there was no session expiry. Unless you restart the 
> node all updates to the system continue to fail.
> These log lines correspond are from Solr 5.3 hence where the WatchedEvent was 
> still being logged as INFO
> We process the connect code and then process the disconnect code out of order 
> based on the log ordering. The connection is active but the flag is not set 
> and hence after 27 seconds {{zkCheck}} starts complaining that the connection 
> is likely expired
> A related Jira is SOLR-5721
> ZK gives us ordered watch events ( 
> https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees
>  ) but from what I understand Solr can still process them out of order. We 
> could take a lock and synchronize {{ConnectionManager#connected}} and 
> {{ConnectionManager#disconnected}} . 
> Would that be the right approach to take?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-11-02 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236685#comment-16236685
 ] 

Scott Blum commented on SOLR-11423:
---

Oh wow, nice catch.  How'd I mess that one up?

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>Priority: Major
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-10-14 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204874#comment-16204874
 ] 

Scott Blum commented on SOLR-11423:
---

Is it possible to pick a number related to ZK's inherent limits?  That's really 
the goal here, prevent Solr from destroying ZK.

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11443) Remove the usage of workqueue for Overseer

2017-10-13 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204443#comment-16204443
 ] 

Scott Blum commented on SOLR-11443:
---

LGTM

> Remove the usage of workqueue for Overseer
> --
>
> Key: SOLR-11443
> URL: https://issues.apache.org/jira/browse/SOLR-11443
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-11443.patch, SOLR-11443.patch, SOLR-11443.patch
>
>
> If we can remove the usage of workqueue, We can save a lot of IO blocking in 
> Overseer, hence boost performance a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-10-13 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204004#comment-16204004
 ] 

Scott Blum commented on SOLR-11423:
---

I'm happy to defer on this issue, but I just want to be clear that I actively 
dislike having a system property.  It feels like a not useful piece of config, 
and worse, what happens if you don't set the same cap on every node?  Remember 
this is enforced client side, not server side.  If you accidentally have a mix 
of nodes where half of them cap at 20k, and half of them cap at 40k, then the 
moment you get above 20k any badly behaving 40k nodes are going to starve out 
the 20k nodes.  It becomes unfair contention.

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11443) Remove the usage of workqueue for Overseer

2017-10-13 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204001#comment-16204001
 ] 

Scott Blum commented on SOLR-11443:
---

SOLR-11447 looks interesting, might well address that comment.

{code}
int cacheSizeBefore = knownChildren.size();
knownChildren.removeAll(paths);
if (cacheSizeBefore - paths.size() == knownChildren.size()) {
  stats.setQueueLength(knownChildren.size());
} else {
  // There are elements get deleted but not present in the cache,
  // the cache seems not valid anymore
  knownChildren.clear();
  isDirty = true;
}
{code}

I just kind of feel like you should unconditionally clear and set dirty, to 
catch any weird edge cases.  What if post removal, knownChildren.size() == 0 in 
the above code?  Having knownChildren empty and !isDirty seems runs the risk of 
report false queue empty status when in fact we just need to pull more nodes 
from ZK.

> Remove the usage of workqueue for Overseer
> --
>
> Key: SOLR-11443
> URL: https://issues.apache.org/jira/browse/SOLR-11443
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-11443.patch, SOLR-11443.patch, SOLR-11443.patch
>
>
> If we can remove the usage of workqueue, We can save a lot of IO blocking in 
> Overseer, hence boost performance a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11443) Remove the usage of workqueue for Overseer

2017-10-13 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203992#comment-16203992
 ] 

Scott Blum commented on SOLR-11443:
---

BTW, have you tried out github PRs?  It would be so much easier to review in 
that tool. :)

> Remove the usage of workqueue for Overseer
> --
>
> Key: SOLR-11443
> URL: https://issues.apache.org/jira/browse/SOLR-11443
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-11443.patch, SOLR-11443.patch, SOLR-11443.patch
>
>
> If we can remove the usage of workqueue, We can save a lot of IO blocking in 
> Overseer, hence boost performance a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11443) Remove the usage of workqueue for Overseer

2017-10-11 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201129#comment-16201129
 ] 

Scott Blum commented on SOLR-11443:
---

Should probably deprecate and update all the doc on the legacy workQueue to 
note that it's only to support the previous version since we don't add anything 
to it anymore.

> Remove the usage of workqueue for Overseer
> --
>
> Key: SOLR-11443
> URL: https://issues.apache.org/jira/browse/SOLR-11443
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-11443.patch, SOLR-11443.patch
>
>
> If we can remove the usage of workqueue, We can save a lot of IO blocking in 
> Overseer, hence boost performance a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11443) Remove the usage of workqueue for Overseer

2017-10-11 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201095#comment-16201095
 ] 

Scott Blum commented on SOLR-11443:
---

Can you talk me through:

{code}
if (knownChildren.containsAll(paths)) {
  knownChildren.removeAll(paths);
  stats.setQueueLength(knownChildren.size());
} else {
  knownChildren.clear();
  isDirty = true;
}
{code}

Seems like you could just always set this dirty; but if you're trying to 
in-memory surgery as an optimization, I don't understand the need for the 
containsAll check.

> Remove the usage of workqueue for Overseer
> --
>
> Key: SOLR-11443
> URL: https://issues.apache.org/jira/browse/SOLR-11443
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-11443.patch, SOLR-11443.patch
>
>
> If we can remove the usage of workqueue, We can save a lot of IO blocking in 
> Overseer, hence boost performance a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11443) Remove the usage of workqueue for Overseer

2017-10-11 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201120#comment-16201120
 ] 

Scott Blum commented on SOLR-11443:
---

Just a few nits / questions, otherwise LGTM.  Super great performance 
improvement and simplification.

> Remove the usage of workqueue for Overseer
> --
>
> Key: SOLR-11443
> URL: https://issues.apache.org/jira/browse/SOLR-11443
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-11443.patch, SOLR-11443.patch
>
>
> If we can remove the usage of workqueue, We can save a lot of IO blocking in 
> Overseer, hence boost performance a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-11443) Remove the usage of workqueue for Overseer

2017-10-11 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201107#comment-16201107
 ] 

Scott Blum edited comment on SOLR-11443 at 10/11/17 10:47 PM:
--

it might be more clear to do the `processedNodes.add(head.first());` _after_ 
calling processQueueItem, maybe conditionally based on whether onWriteAfter was 
called.  IE:

{code}
Set processedNodes = new HashSet<>();
String[] curNode = new String[1];
while (!queue.isEmpty()) {
  for (Pair head : queue) {
curNode[0] = head.first();
byte[] data = head.second();
final ZkNodeProps message = ZkNodeProps.load(data);
log.debug("processMessage: queueSize: {}, message = {} current 
state version: {}", stateUpdateQueue.getStats().getQueueLength(), message, 
clusterState.getZkClusterStateVersion());
// The callback always be called on this thread
clusterState = processQueueItem(message, clusterState, 
zkStateWriter, true, new ZkStateWriter.ZkWriteCallback() {
  @Override
  public void onWriteBefore() throws Exception {
stateUpdateQueue.remove(processedNodes);
processedNodes.clear();
  }

  @Override
  public void onWriteAfter() throws Exception {
processedNodes.add(curNode[0]);
stateUpdateQueue.remove(processedNodes);
processedNodes.clear();
curNode[0] = null;
  }
});
  }
  if (curNode[0] != null) {
// e.g. onWriteAfter was not called
processedNodes.add(curNode[0]);
  }
  queue = new LinkedList<>(stateUpdateQueue.peekElements(1000, 100, 
node -> !processedNodes.contains(node)));
}
{code}


was (Author: dragonsinth):
it might be more clear to do the `processedNodes.add(head.first());` _after_ 
calling processQueueItem, maybe conditionally based on whether onWriteAfter was 
called.

> Remove the usage of workqueue for Overseer
> --
>
> Key: SOLR-11443
> URL: https://issues.apache.org/jira/browse/SOLR-11443
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-11443.patch, SOLR-11443.patch
>
>
> If we can remove the usage of workqueue, We can save a lot of IO blocking in 
> Overseer, hence boost performance a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11443) Remove the usage of workqueue for Overseer

2017-10-11 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201107#comment-16201107
 ] 

Scott Blum commented on SOLR-11443:
---

it might be more clear to do the `processedNodes.add(head.first());` _after_ 
calling processQueueItem, maybe conditionally based on whether onWriteAfter was 
called.

> Remove the usage of workqueue for Overseer
> --
>
> Key: SOLR-11443
> URL: https://issues.apache.org/jira/browse/SOLR-11443
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-11443.patch, SOLR-11443.patch
>
>
> If we can remove the usage of workqueue, We can save a lot of IO blocking in 
> Overseer, hence boost performance a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11443) Remove the usage of workqueue for Overseer

2017-10-11 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201081#comment-16201081
 ] 

Scott Blum commented on SOLR-11443:
---

When would numUpdates diverge from updates.size()?

> Remove the usage of workqueue for Overseer
> --
>
> Key: SOLR-11443
> URL: https://issues.apache.org/jira/browse/SOLR-11443
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-11443.patch, SOLR-11443.patch
>
>
> If we can remove the usage of workqueue, We can save a lot of IO blocking in 
> Overseer, hence boost performance a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11443) Remove the usage of workqueue for Overseer

2017-10-11 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201074#comment-16201074
 ] 

Scott Blum commented on SOLR-11443:
---

I feel like the maybeFlushBefore, maybeFlushAfter bits need a little more 
thinking.  Seems pretty arbitrary to only check the firstCommand; maybe we 
should completely separate command-specific flush trigger from general purpose 
flush trigger?  Then you could check command-level flushing on each command, if 
that's even still necessary.

> Remove the usage of workqueue for Overseer
> --
>
> Key: SOLR-11443
> URL: https://issues.apache.org/jira/browse/SOLR-11443
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-11443.patch, SOLR-11443.patch
>
>
> If we can remove the usage of workqueue, We can save a lot of IO blocking in 
> Overseer, hence boost performance a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11443) Remove the usage of workqueue for Overseer

2017-10-11 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201064#comment-16201064
 ] 

Scott Blum commented on SOLR-11443:
---

I might be thinking about this wrong, but the test seems to trying to thread an 
invisible needle, I guess we're trying to shut down overseer halfway through 
the list of updates?  But we might very well just complete all operations 
quickly and restart overseer after they're all done.

> Remove the usage of workqueue for Overseer
> --
>
> Key: SOLR-11443
> URL: https://issues.apache.org/jira/browse/SOLR-11443
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-11443.patch, SOLR-11443.patch
>
>
> If we can remove the usage of workqueue, We can save a lot of IO blocking in 
> Overseer, hence boost performance a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11443) Remove the usage of workqueue for Overseer

2017-10-11 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16200881#comment-16200881
 ] 

Scott Blum commented on SOLR-11443:
---

Love the idea; having a separate queue-work never made much sense to me.  I can 
look at the patch in a bit.

> Remove the usage of workqueue for Overseer
> --
>
> Key: SOLR-11443
> URL: https://issues.apache.org/jira/browse/SOLR-11443
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-11443.patch, SOLR-11443.patch
>
>
> If we can remove the usage of workqueue, We can save a lot of IO blocking in 
> Overseer, hence boost performance a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-10-06 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194900#comment-16194900
 ] 

Scott Blum commented on SOLR-11423:
---

Perhaps you're right.  In our cluster though, Overseer is not able to chew 
through 20k entries very fast; it takes a long time (many minutes) to get 
through that number of items.  Our current escape valve is a tiny separate tool 
that just literally just goes through the queue and deletes everything. :D

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-10-05 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193435#comment-16193435
 ] 

Scott Blum commented on SOLR-11423:
---

[~noble.paul] both good questions!

> This looks fine. I'm just worried about the extra cost of the extra cost of 
> {{Stat stat = zookeeper.exists(dir, null, true);}} for each call.

Indeed!  That's why in my second pass, I added `offerPermits`.  As long as the 
queue is mostly empty, each client will only bother checking the stats about 
every 200 queued entries.  Agreed on a "create sequential node if parent node 
has fewer then XXX entries".

> Another point to consider is , is the rest of Solr code designed to handle 
> this error? I guess, the thread should wait for a few seconds and retry if 
> the no: of items fell below the limit. This will dramatically reduce the no: 
> of other errors in the system.

I thought about this point quite a bit, and came down on the side of erroring 
immediately.  Again, I'm thinking of this mostly like an automatic emergency 
shutoff in a nuclear reactor: you hope you never need it.  The point being, if 
you're in a state where you have 20k items in the queue, you're already in a 
pathologically bad state.  I can't see how adding latency and hoping things get 
better will improve the situation vs. erroring out immediately.  I've never 
seen a solr cluster recover on its own once the queue got that high, it always 
required manual intervention.

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-10-04 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191962#comment-16191962
 ] 

Scott Blum commented on SOLR-11423:
---

[~shalinmangar] [~noble.paul] either of you want to take a look at this change 
and +1 or -1?

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-10-02 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1617#comment-1617
 ] 

Scott Blum commented on SOLR-11423:
---

Patch passes tests for me.

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-10-02 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188867#comment-16188867
 ] 

Scott Blum commented on SOLR-11423:
---

[~noble.paul] I went with 20k.  We could make it configurable, but again, I 
intend this to be a general purpose safety valve to protect Zookeeper from 
exploding, which makes me think we could pick a universal value.

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-10-02 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188803#comment-16188803
 ] 

Scott Blum commented on SOLR-11423:
---

[~erickerickson] I have backported changes to reduce overseer task counts.  
This isn't an issue with normal operation.  Think of this more as an automatic 
safety shutoff on a nuclear reactor.

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-09-29 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186282#comment-16186282
 ] 

Scott Blum commented on SOLR-11423:
---

CC: [~jhump] [~noble.paul] [~shalinmangar]

> Overseer queue needs a hard cap (maximum size) that clients respect
> ---
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect

2017-09-29 Thread Scott Blum (JIRA)

Scott Blum created SOLR-11423:
-

 Summary: Overseer queue needs a hard cap (maximum size) that 
clients respect
 Key: SOLR-11423
 URL: https://issues.apache.org/jira/browse/SOLR-11423
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrCloud
Reporter: Scott Blum
Assignee: Scott Blum


When Solr gets into pathological GC thrashing states, it can fill the overseer 
queue with literally thousands and thousands of queued state changes.  Many of 
these end up being duplicated up/down state updates.  Our production cluster 
has gotten to the 100k queued items level many times, and there's nothing 
useful you can do at this point except manually purge the queue in ZK.  
Recently, it hit 3 million queued items, at which point our entire ZK cluster 
exploded.

I propose a hard cap.  Any client trying to enqueue a item when a queue is full 
would throw an exception.  I was thinking maybe 10,000 items would be a 
reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11025) OverseerTest.testShardLeaderChange() failures

2017-07-07 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078603#comment-16078603
 ] 

Scott Blum commented on SOLR-11025:
---

Thanks guys!  I didn't bother adding an entry to CHANGES.txt, didn't seem 
relevant for user facing purposes.

> OverseerTest.testShardLeaderChange() failures
> -
>
> Key: SOLR-11025
> URL: https://issues.apache.org/jira/browse/SOLR-11025
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Steve Rowe
> Attachments: SOLR-11025.patch
>
>
> Non-reproducing Jenkins failure - this test hasn't failed in Jenkins in 
> months, and suddenly several failures within days of each other:
> [https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/18/]:
> {noformat}
> Checking out Revision 986175915927ee2bbd971340f858601c86b3c676 
> (refs/remotes/origin/branch_7x)
> [...]
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=OverseerTest 
> -Dtests.method=testShardLeaderChange -Dtests.seed=995C82D4739EF7D8 
> -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ru 
> -Dtests.timezone=Asia/Calcutta -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
>[junit4] FAILURE  223s J2 | OverseerTest.testShardLeaderChange <<<
>[junit4]> Throwable #1: java.lang.AssertionError: Unexpected shard 
> leader coll:collection1 shard:shard1 expected: but was:
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([995C82D4739EF7D8:470F052369060229]:0)
>[junit4]>  at 
> org.apache.solr.cloud.OverseerTest.verifyShardLeader(OverseerTest.java:486)
>[junit4]>  at 
> org.apache.solr.cloud.OverseerTest.testShardLeaderChange(OverseerTest.java:720)
> [...]
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {}, 
> docValues:{}, maxPointsInLeafNode=892, maxMBSortInHeap=5.965872045375053, 
> sim=RandomSimilarity(queryNorm=false): {}, locale=ru, timezone=Asia/Calcutta
>[junit4]   2> NOTE: Linux 4.10.0-21-generic i386/Oracle Corporation 
> 1.8.0_131 (32-bit)/cpus=8,threads=1,free=96275160,total=293539840
>[junit4]   2> NOTE: All tests run in this JVM: [HardAutoCommitTest, 
> SimplePostToolTest, TestSolrCoreSnapshots, ReplicationFactorTest, 
> TestSolrCoreProperties, TestSha256AuthenticationProvider, 
> TestExpandComponent, TestCloudRecovery, ConfigureRecoveryStrategyTest, 
> TimeZoneUtilsTest, TestTolerantUpdateProcessorRandomCloud, 
> TestExceedMaxTermLength, PrimitiveFieldTypeTest, SolrInfoBeanTest, 
> FullSolrCloudDistribCmdsTest, UniqFieldsUpdateProcessorFactoryTest, 
> TestReplicaProperties, SolrCloudReportersTest, UnloadDistributedZkTest, 
> CleanupOldIndexTest, LeaderInitiatedRecoveryOnShardRestartTest, 
> FastVectorHighlighterTest, TestOrdValues, MinimalSchemaTest, 
> TestSubQueryTransformerCrossCore, ParsingFieldUpdateProcessorsTest, 
> BufferStoreTest, TestLegacyFieldReuse, CollectionsAPISolrJTest, 
> TestSchemalessBufferedUpdates, ConnectionManagerTest, MetricsConfigTest, 
> TestPayloadScoreQParserPlugin, TermVectorComponentDistributedTest, 
> TestLMJelinekMercerSimilarityFactory, TestFastWriter, MultiTermTest, 
> HdfsBasicDistributedZk2Test, V2StandaloneTest, DocumentBuilderTest, 
> TestMultiValuedNumericRangeQuery, AnalyticsMergeStrategyTest, 
> TestNumericTokenStream, TestFieldCacheSort, 
> TestSolrCloudWithSecureImpersonation, TestBlendedInfixSuggestions, 
> ResponseLogComponentTest, CopyFieldTest, TestAuthenticationFramework, 
> BlockJoinFacetDistribTest, TestFieldSortValues, TestJmxIntegration, 
> SolrTestCaseJ4Test, BasicAuthIntegrationTest, URLClassifyProcessorTest, 
> DateFieldTest, TestExactSharedStatsCache, TestFieldTypeCollectionResource, 
> ExplicitHLLTest, ConjunctionSolrSpellCheckerTest, 
> TestLeaderElectionWithEmptyReplica, TestReloadAndDeleteDocs, 
> ClusterStateTest, TestSQLHandler, HdfsRecoverLeaseTest, QueryEqualityTest, 
> UUIDUpdateProcessorFallbackTest, ClassificationUpdateProcessorFactoryTest, 
> DistributedSuggestComponentTest, TestHalfAndHalfDocValues, 
> ShowFileRequestHandlerTest, ExitableDirectoryReaderTest, 
> TestInfoStreamLogging, TestLocalFSCloudBackupRestore, 
> ChaosMonkeyNothingIsSafeWithPullReplicasTest, TestSort, NumericFieldsTest, 
> DirectUpdateHandlerTest, SuggesterFSTTest, NodeMutatorTest, 
> DateMathParserTest, DistribCursorPagingTest, CircularListTest, 
> CloneFieldUpdateProcessorFactoryTest, 
> OverseerCollectionConfigSetProcessorTest, DateRangeFieldTest, 
> TestDFRSimilarityFactory, XsltUpdateRequestHandlerTest, TestConfigSetsAPI, 
> CoreMergeIndexesAdminHandlerTest, TestSerializedLuceneMatchVersion, 
> TestPrepRecovery, TestNoOpRegenerator, DistributedFacetPivotWhiteBoxTest, 
> LeaderFailoverAfterPartitionTest, TestSolrFieldCacheBean, OverseerTest]
> {noformat}
> Following is the first of 4 failures from my

[jira] [Assigned] (SOLR-11025) OverseerTest.testShardLeaderChange() failures

2017-07-07 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum reassigned SOLR-11025:
-

Assignee: Scott Blum

> OverseerTest.testShardLeaderChange() failures
> -
>
> Key: SOLR-11025
> URL: https://issues.apache.org/jira/browse/SOLR-11025
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Steve Rowe
>Assignee: Scott Blum
> Attachments: SOLR-11025.patch
>
>
> Non-reproducing Jenkins failure - this test hasn't failed in Jenkins in 
> months, and suddenly several failures within days of each other:
> [https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/18/]:
> {noformat}
> Checking out Revision 986175915927ee2bbd971340f858601c86b3c676 
> (refs/remotes/origin/branch_7x)
> [...]
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=OverseerTest 
> -Dtests.method=testShardLeaderChange -Dtests.seed=995C82D4739EF7D8 
> -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ru 
> -Dtests.timezone=Asia/Calcutta -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
>[junit4] FAILURE  223s J2 | OverseerTest.testShardLeaderChange <<<
>[junit4]> Throwable #1: java.lang.AssertionError: Unexpected shard 
> leader coll:collection1 shard:shard1 expected: but was:
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([995C82D4739EF7D8:470F052369060229]:0)
>[junit4]>  at 
> org.apache.solr.cloud.OverseerTest.verifyShardLeader(OverseerTest.java:486)
>[junit4]>  at 
> org.apache.solr.cloud.OverseerTest.testShardLeaderChange(OverseerTest.java:720)
> [...]
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {}, 
> docValues:{}, maxPointsInLeafNode=892, maxMBSortInHeap=5.965872045375053, 
> sim=RandomSimilarity(queryNorm=false): {}, locale=ru, timezone=Asia/Calcutta
>[junit4]   2> NOTE: Linux 4.10.0-21-generic i386/Oracle Corporation 
> 1.8.0_131 (32-bit)/cpus=8,threads=1,free=96275160,total=293539840
>[junit4]   2> NOTE: All tests run in this JVM: [HardAutoCommitTest, 
> SimplePostToolTest, TestSolrCoreSnapshots, ReplicationFactorTest, 
> TestSolrCoreProperties, TestSha256AuthenticationProvider, 
> TestExpandComponent, TestCloudRecovery, ConfigureRecoveryStrategyTest, 
> TimeZoneUtilsTest, TestTolerantUpdateProcessorRandomCloud, 
> TestExceedMaxTermLength, PrimitiveFieldTypeTest, SolrInfoBeanTest, 
> FullSolrCloudDistribCmdsTest, UniqFieldsUpdateProcessorFactoryTest, 
> TestReplicaProperties, SolrCloudReportersTest, UnloadDistributedZkTest, 
> CleanupOldIndexTest, LeaderInitiatedRecoveryOnShardRestartTest, 
> FastVectorHighlighterTest, TestOrdValues, MinimalSchemaTest, 
> TestSubQueryTransformerCrossCore, ParsingFieldUpdateProcessorsTest, 
> BufferStoreTest, TestLegacyFieldReuse, CollectionsAPISolrJTest, 
> TestSchemalessBufferedUpdates, ConnectionManagerTest, MetricsConfigTest, 
> TestPayloadScoreQParserPlugin, TermVectorComponentDistributedTest, 
> TestLMJelinekMercerSimilarityFactory, TestFastWriter, MultiTermTest, 
> HdfsBasicDistributedZk2Test, V2StandaloneTest, DocumentBuilderTest, 
> TestMultiValuedNumericRangeQuery, AnalyticsMergeStrategyTest, 
> TestNumericTokenStream, TestFieldCacheSort, 
> TestSolrCloudWithSecureImpersonation, TestBlendedInfixSuggestions, 
> ResponseLogComponentTest, CopyFieldTest, TestAuthenticationFramework, 
> BlockJoinFacetDistribTest, TestFieldSortValues, TestJmxIntegration, 
> SolrTestCaseJ4Test, BasicAuthIntegrationTest, URLClassifyProcessorTest, 
> DateFieldTest, TestExactSharedStatsCache, TestFieldTypeCollectionResource, 
> ExplicitHLLTest, ConjunctionSolrSpellCheckerTest, 
> TestLeaderElectionWithEmptyReplica, TestReloadAndDeleteDocs, 
> ClusterStateTest, TestSQLHandler, HdfsRecoverLeaseTest, QueryEqualityTest, 
> UUIDUpdateProcessorFallbackTest, ClassificationUpdateProcessorFactoryTest, 
> DistributedSuggestComponentTest, TestHalfAndHalfDocValues, 
> ShowFileRequestHandlerTest, ExitableDirectoryReaderTest, 
> TestInfoStreamLogging, TestLocalFSCloudBackupRestore, 
> ChaosMonkeyNothingIsSafeWithPullReplicasTest, TestSort, NumericFieldsTest, 
> DirectUpdateHandlerTest, SuggesterFSTTest, NodeMutatorTest, 
> DateMathParserTest, DistribCursorPagingTest, CircularListTest, 
> CloneFieldUpdateProcessorFactoryTest, 
> OverseerCollectionConfigSetProcessorTest, DateRangeFieldTest, 
> TestDFRSimilarityFactory, XsltUpdateRequestHandlerTest, TestConfigSetsAPI, 
> CoreMergeIndexesAdminHandlerTest, TestSerializedLuceneMatchVersion, 
> TestPrepRecovery, TestNoOpRegenerator, DistributedFacetPivotWhiteBoxTest, 
> LeaderFailoverAfterPartitionTest, TestSolrFieldCacheBean, OverseerTest]
> {noformat}
> Following is the first of 4 failures from my Jenkins in the last 24 hours or 
> so on branch_7_0 and branch_7x:
> {noformat}
>

[jira] [Resolved] (SOLR-11025) OverseerTest.testShardLeaderChange() failures

2017-07-07 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum resolved SOLR-11025.
---
Resolution: Fixed

> OverseerTest.testShardLeaderChange() failures
> -
>
> Key: SOLR-11025
> URL: https://issues.apache.org/jira/browse/SOLR-11025
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Steve Rowe
> Attachments: SOLR-11025.patch
>
>
> Non-reproducing Jenkins failure - this test hasn't failed in Jenkins in 
> months, and suddenly several failures within days of each other:
> [https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/18/]:
> {noformat}
> Checking out Revision 986175915927ee2bbd971340f858601c86b3c676 
> (refs/remotes/origin/branch_7x)
> [...]
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=OverseerTest 
> -Dtests.method=testShardLeaderChange -Dtests.seed=995C82D4739EF7D8 
> -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ru 
> -Dtests.timezone=Asia/Calcutta -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
>[junit4] FAILURE  223s J2 | OverseerTest.testShardLeaderChange <<<
>[junit4]> Throwable #1: java.lang.AssertionError: Unexpected shard 
> leader coll:collection1 shard:shard1 expected: but was:
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([995C82D4739EF7D8:470F052369060229]:0)
>[junit4]>  at 
> org.apache.solr.cloud.OverseerTest.verifyShardLeader(OverseerTest.java:486)
>[junit4]>  at 
> org.apache.solr.cloud.OverseerTest.testShardLeaderChange(OverseerTest.java:720)
> [...]
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {}, 
> docValues:{}, maxPointsInLeafNode=892, maxMBSortInHeap=5.965872045375053, 
> sim=RandomSimilarity(queryNorm=false): {}, locale=ru, timezone=Asia/Calcutta
>[junit4]   2> NOTE: Linux 4.10.0-21-generic i386/Oracle Corporation 
> 1.8.0_131 (32-bit)/cpus=8,threads=1,free=96275160,total=293539840
>[junit4]   2> NOTE: All tests run in this JVM: [HardAutoCommitTest, 
> SimplePostToolTest, TestSolrCoreSnapshots, ReplicationFactorTest, 
> TestSolrCoreProperties, TestSha256AuthenticationProvider, 
> TestExpandComponent, TestCloudRecovery, ConfigureRecoveryStrategyTest, 
> TimeZoneUtilsTest, TestTolerantUpdateProcessorRandomCloud, 
> TestExceedMaxTermLength, PrimitiveFieldTypeTest, SolrInfoBeanTest, 
> FullSolrCloudDistribCmdsTest, UniqFieldsUpdateProcessorFactoryTest, 
> TestReplicaProperties, SolrCloudReportersTest, UnloadDistributedZkTest, 
> CleanupOldIndexTest, LeaderInitiatedRecoveryOnShardRestartTest, 
> FastVectorHighlighterTest, TestOrdValues, MinimalSchemaTest, 
> TestSubQueryTransformerCrossCore, ParsingFieldUpdateProcessorsTest, 
> BufferStoreTest, TestLegacyFieldReuse, CollectionsAPISolrJTest, 
> TestSchemalessBufferedUpdates, ConnectionManagerTest, MetricsConfigTest, 
> TestPayloadScoreQParserPlugin, TermVectorComponentDistributedTest, 
> TestLMJelinekMercerSimilarityFactory, TestFastWriter, MultiTermTest, 
> HdfsBasicDistributedZk2Test, V2StandaloneTest, DocumentBuilderTest, 
> TestMultiValuedNumericRangeQuery, AnalyticsMergeStrategyTest, 
> TestNumericTokenStream, TestFieldCacheSort, 
> TestSolrCloudWithSecureImpersonation, TestBlendedInfixSuggestions, 
> ResponseLogComponentTest, CopyFieldTest, TestAuthenticationFramework, 
> BlockJoinFacetDistribTest, TestFieldSortValues, TestJmxIntegration, 
> SolrTestCaseJ4Test, BasicAuthIntegrationTest, URLClassifyProcessorTest, 
> DateFieldTest, TestExactSharedStatsCache, TestFieldTypeCollectionResource, 
> ExplicitHLLTest, ConjunctionSolrSpellCheckerTest, 
> TestLeaderElectionWithEmptyReplica, TestReloadAndDeleteDocs, 
> ClusterStateTest, TestSQLHandler, HdfsRecoverLeaseTest, QueryEqualityTest, 
> UUIDUpdateProcessorFallbackTest, ClassificationUpdateProcessorFactoryTest, 
> DistributedSuggestComponentTest, TestHalfAndHalfDocValues, 
> ShowFileRequestHandlerTest, ExitableDirectoryReaderTest, 
> TestInfoStreamLogging, TestLocalFSCloudBackupRestore, 
> ChaosMonkeyNothingIsSafeWithPullReplicasTest, TestSort, NumericFieldsTest, 
> DirectUpdateHandlerTest, SuggesterFSTTest, NodeMutatorTest, 
> DateMathParserTest, DistribCursorPagingTest, CircularListTest, 
> CloneFieldUpdateProcessorFactoryTest, 
> OverseerCollectionConfigSetProcessorTest, DateRangeFieldTest, 
> TestDFRSimilarityFactory, XsltUpdateRequestHandlerTest, TestConfigSetsAPI, 
> CoreMergeIndexesAdminHandlerTest, TestSerializedLuceneMatchVersion, 
> TestPrepRecovery, TestNoOpRegenerator, DistributedFacetPivotWhiteBoxTest, 
> LeaderFailoverAfterPartitionTest, TestSolrFieldCacheBean, OverseerTest]
> {noformat}
> Following is the first of 4 failures from my Jenkins in the last 24 hours or 
> so on branch_7_0 and branch_7x:
> {noformat}
> Checking out Revision

[jira] [Commented] (SOLR-11025) OverseerTest.testShardLeaderChange() failures

2017-07-07 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078506#comment-16078506
 ] 

Scott Blum commented on SOLR-11025:
---

You da bomb [~steve_rowe]

> OverseerTest.testShardLeaderChange() failures
> -
>
> Key: SOLR-11025
> URL: https://issues.apache.org/jira/browse/SOLR-11025
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Steve Rowe
> Attachments: SOLR-11025.patch
>
>
> Non-reproducing Jenkins failure - this test hasn't failed in Jenkins in 
> months, and suddenly several failures within days of each other:
> [https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/18/]:
> {noformat}
> Checking out Revision 986175915927ee2bbd971340f858601c86b3c676 
> (refs/remotes/origin/branch_7x)
> [...]
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=OverseerTest 
> -Dtests.method=testShardLeaderChange -Dtests.seed=995C82D4739EF7D8 
> -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ru 
> -Dtests.timezone=Asia/Calcutta -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
>[junit4] FAILURE  223s J2 | OverseerTest.testShardLeaderChange <<<
>[junit4]> Throwable #1: java.lang.AssertionError: Unexpected shard 
> leader coll:collection1 shard:shard1 expected: but was:
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([995C82D4739EF7D8:470F052369060229]:0)
>[junit4]>  at 
> org.apache.solr.cloud.OverseerTest.verifyShardLeader(OverseerTest.java:486)
>[junit4]>  at 
> org.apache.solr.cloud.OverseerTest.testShardLeaderChange(OverseerTest.java:720)
> [...]
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {}, 
> docValues:{}, maxPointsInLeafNode=892, maxMBSortInHeap=5.965872045375053, 
> sim=RandomSimilarity(queryNorm=false): {}, locale=ru, timezone=Asia/Calcutta
>[junit4]   2> NOTE: Linux 4.10.0-21-generic i386/Oracle Corporation 
> 1.8.0_131 (32-bit)/cpus=8,threads=1,free=96275160,total=293539840
>[junit4]   2> NOTE: All tests run in this JVM: [HardAutoCommitTest, 
> SimplePostToolTest, TestSolrCoreSnapshots, ReplicationFactorTest, 
> TestSolrCoreProperties, TestSha256AuthenticationProvider, 
> TestExpandComponent, TestCloudRecovery, ConfigureRecoveryStrategyTest, 
> TimeZoneUtilsTest, TestTolerantUpdateProcessorRandomCloud, 
> TestExceedMaxTermLength, PrimitiveFieldTypeTest, SolrInfoBeanTest, 
> FullSolrCloudDistribCmdsTest, UniqFieldsUpdateProcessorFactoryTest, 
> TestReplicaProperties, SolrCloudReportersTest, UnloadDistributedZkTest, 
> CleanupOldIndexTest, LeaderInitiatedRecoveryOnShardRestartTest, 
> FastVectorHighlighterTest, TestOrdValues, MinimalSchemaTest, 
> TestSubQueryTransformerCrossCore, ParsingFieldUpdateProcessorsTest, 
> BufferStoreTest, TestLegacyFieldReuse, CollectionsAPISolrJTest, 
> TestSchemalessBufferedUpdates, ConnectionManagerTest, MetricsConfigTest, 
> TestPayloadScoreQParserPlugin, TermVectorComponentDistributedTest, 
> TestLMJelinekMercerSimilarityFactory, TestFastWriter, MultiTermTest, 
> HdfsBasicDistributedZk2Test, V2StandaloneTest, DocumentBuilderTest, 
> TestMultiValuedNumericRangeQuery, AnalyticsMergeStrategyTest, 
> TestNumericTokenStream, TestFieldCacheSort, 
> TestSolrCloudWithSecureImpersonation, TestBlendedInfixSuggestions, 
> ResponseLogComponentTest, CopyFieldTest, TestAuthenticationFramework, 
> BlockJoinFacetDistribTest, TestFieldSortValues, TestJmxIntegration, 
> SolrTestCaseJ4Test, BasicAuthIntegrationTest, URLClassifyProcessorTest, 
> DateFieldTest, TestExactSharedStatsCache, TestFieldTypeCollectionResource, 
> ExplicitHLLTest, ConjunctionSolrSpellCheckerTest, 
> TestLeaderElectionWithEmptyReplica, TestReloadAndDeleteDocs, 
> ClusterStateTest, TestSQLHandler, HdfsRecoverLeaseTest, QueryEqualityTest, 
> UUIDUpdateProcessorFallbackTest, ClassificationUpdateProcessorFactoryTest, 
> DistributedSuggestComponentTest, TestHalfAndHalfDocValues, 
> ShowFileRequestHandlerTest, ExitableDirectoryReaderTest, 
> TestInfoStreamLogging, TestLocalFSCloudBackupRestore, 
> ChaosMonkeyNothingIsSafeWithPullReplicasTest, TestSort, NumericFieldsTest, 
> DirectUpdateHandlerTest, SuggesterFSTTest, NodeMutatorTest, 
> DateMathParserTest, DistribCursorPagingTest, CircularListTest, 
> CloneFieldUpdateProcessorFactoryTest, 
> OverseerCollectionConfigSetProcessorTest, DateRangeFieldTest, 
> TestDFRSimilarityFactory, XsltUpdateRequestHandlerTest, TestConfigSetsAPI, 
> CoreMergeIndexesAdminHandlerTest, TestSerializedLuceneMatchVersion, 
> TestPrepRecovery, TestNoOpRegenerator, DistributedFacetPivotWhiteBoxTest, 
> LeaderFailoverAfterPartitionTest, TestSolrFieldCacheBean, OverseerTest]
> {noformat}
> Following is the first of 4 failures from my Jenkins in the last 24 hours or 
> so on branch_7_0 and branch_7x:
> {noformat}
>

[jira] [Commented] (SOLR-11025) OverseerTest.testShardLeaderChange() failures

2017-07-07 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078440#comment-16078440
 ] 

Scott Blum commented on SOLR-11025:
---

Will committing to master be sufficient a test?  Or does it need to go on 7x 
and 7_0 immediately?

> OverseerTest.testShardLeaderChange() failures
> -
>
> Key: SOLR-11025
> URL: https://issues.apache.org/jira/browse/SOLR-11025
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Steve Rowe
> Attachments: SOLR-11025.patch
>
>
> Non-reproducing Jenkins failure - this test hasn't failed in Jenkins in 
> months, and suddenly several failures within days of each other:
> [https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/18/]:
> {noformat}
> Checking out Revision 986175915927ee2bbd971340f858601c86b3c676 
> (refs/remotes/origin/branch_7x)
> [...]
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=OverseerTest 
> -Dtests.method=testShardLeaderChange -Dtests.seed=995C82D4739EF7D8 
> -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ru 
> -Dtests.timezone=Asia/Calcutta -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
>[junit4] FAILURE  223s J2 | OverseerTest.testShardLeaderChange <<<
>[junit4]> Throwable #1: java.lang.AssertionError: Unexpected shard 
> leader coll:collection1 shard:shard1 expected: but was:
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([995C82D4739EF7D8:470F052369060229]:0)
>[junit4]>  at 
> org.apache.solr.cloud.OverseerTest.verifyShardLeader(OverseerTest.java:486)
>[junit4]>  at 
> org.apache.solr.cloud.OverseerTest.testShardLeaderChange(OverseerTest.java:720)
> [...]
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {}, 
> docValues:{}, maxPointsInLeafNode=892, maxMBSortInHeap=5.965872045375053, 
> sim=RandomSimilarity(queryNorm=false): {}, locale=ru, timezone=Asia/Calcutta
>[junit4]   2> NOTE: Linux 4.10.0-21-generic i386/Oracle Corporation 
> 1.8.0_131 (32-bit)/cpus=8,threads=1,free=96275160,total=293539840
>[junit4]   2> NOTE: All tests run in this JVM: [HardAutoCommitTest, 
> SimplePostToolTest, TestSolrCoreSnapshots, ReplicationFactorTest, 
> TestSolrCoreProperties, TestSha256AuthenticationProvider, 
> TestExpandComponent, TestCloudRecovery, ConfigureRecoveryStrategyTest, 
> TimeZoneUtilsTest, TestTolerantUpdateProcessorRandomCloud, 
> TestExceedMaxTermLength, PrimitiveFieldTypeTest, SolrInfoBeanTest, 
> FullSolrCloudDistribCmdsTest, UniqFieldsUpdateProcessorFactoryTest, 
> TestReplicaProperties, SolrCloudReportersTest, UnloadDistributedZkTest, 
> CleanupOldIndexTest, LeaderInitiatedRecoveryOnShardRestartTest, 
> FastVectorHighlighterTest, TestOrdValues, MinimalSchemaTest, 
> TestSubQueryTransformerCrossCore, ParsingFieldUpdateProcessorsTest, 
> BufferStoreTest, TestLegacyFieldReuse, CollectionsAPISolrJTest, 
> TestSchemalessBufferedUpdates, ConnectionManagerTest, MetricsConfigTest, 
> TestPayloadScoreQParserPlugin, TermVectorComponentDistributedTest, 
> TestLMJelinekMercerSimilarityFactory, TestFastWriter, MultiTermTest, 
> HdfsBasicDistributedZk2Test, V2StandaloneTest, DocumentBuilderTest, 
> TestMultiValuedNumericRangeQuery, AnalyticsMergeStrategyTest, 
> TestNumericTokenStream, TestFieldCacheSort, 
> TestSolrCloudWithSecureImpersonation, TestBlendedInfixSuggestions, 
> ResponseLogComponentTest, CopyFieldTest, TestAuthenticationFramework, 
> BlockJoinFacetDistribTest, TestFieldSortValues, TestJmxIntegration, 
> SolrTestCaseJ4Test, BasicAuthIntegrationTest, URLClassifyProcessorTest, 
> DateFieldTest, TestExactSharedStatsCache, TestFieldTypeCollectionResource, 
> ExplicitHLLTest, ConjunctionSolrSpellCheckerTest, 
> TestLeaderElectionWithEmptyReplica, TestReloadAndDeleteDocs, 
> ClusterStateTest, TestSQLHandler, HdfsRecoverLeaseTest, QueryEqualityTest, 
> UUIDUpdateProcessorFallbackTest, ClassificationUpdateProcessorFactoryTest, 
> DistributedSuggestComponentTest, TestHalfAndHalfDocValues, 
> ShowFileRequestHandlerTest, ExitableDirectoryReaderTest, 
> TestInfoStreamLogging, TestLocalFSCloudBackupRestore, 
> ChaosMonkeyNothingIsSafeWithPullReplicasTest, TestSort, NumericFieldsTest, 
> DirectUpdateHandlerTest, SuggesterFSTTest, NodeMutatorTest, 
> DateMathParserTest, DistribCursorPagingTest, CircularListTest, 
> CloneFieldUpdateProcessorFactoryTest, 
> OverseerCollectionConfigSetProcessorTest, DateRangeFieldTest, 
> TestDFRSimilarityFactory, XsltUpdateRequestHandlerTest, TestConfigSetsAPI, 
> CoreMergeIndexesAdminHandlerTest, TestSerializedLuceneMatchVersion, 
> TestPrepRecovery, TestNoOpRegenerator, DistributedFacetPivotWhiteBoxTest, 
> LeaderFailoverAfterPartitionTest, TestSolrFieldCacheBean, OverseerTest]
> {noformat}
> Following is the first of 4 failures from my Jenkins in the

[jira] [Updated] (SOLR-11025) OverseerTest.testShardLeaderChange() failures

2017-07-07 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-11025:
--
Attachment: SOLR-11025.patch

Any way to test if this fixes the failure before we commit it to 3 branches?

> OverseerTest.testShardLeaderChange() failures
> -
>
> Key: SOLR-11025
> URL: https://issues.apache.org/jira/browse/SOLR-11025
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Steve Rowe
> Attachments: SOLR-11025.patch
>
>
> Non-reproducing Jenkins failure - this test hasn't failed in Jenkins in 
> months, and suddenly several failures within days of each other:
> [https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/18/]:
> {noformat}
> Checking out Revision 986175915927ee2bbd971340f858601c86b3c676 
> (refs/remotes/origin/branch_7x)
> [...]
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=OverseerTest 
> -Dtests.method=testShardLeaderChange -Dtests.seed=995C82D4739EF7D8 
> -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ru 
> -Dtests.timezone=Asia/Calcutta -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
>[junit4] FAILURE  223s J2 | OverseerTest.testShardLeaderChange <<<
>[junit4]> Throwable #1: java.lang.AssertionError: Unexpected shard 
> leader coll:collection1 shard:shard1 expected: but was:
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([995C82D4739EF7D8:470F052369060229]:0)
>[junit4]>  at 
> org.apache.solr.cloud.OverseerTest.verifyShardLeader(OverseerTest.java:486)
>[junit4]>  at 
> org.apache.solr.cloud.OverseerTest.testShardLeaderChange(OverseerTest.java:720)
> [...]
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {}, 
> docValues:{}, maxPointsInLeafNode=892, maxMBSortInHeap=5.965872045375053, 
> sim=RandomSimilarity(queryNorm=false): {}, locale=ru, timezone=Asia/Calcutta
>[junit4]   2> NOTE: Linux 4.10.0-21-generic i386/Oracle Corporation 
> 1.8.0_131 (32-bit)/cpus=8,threads=1,free=96275160,total=293539840
>[junit4]   2> NOTE: All tests run in this JVM: [HardAutoCommitTest, 
> SimplePostToolTest, TestSolrCoreSnapshots, ReplicationFactorTest, 
> TestSolrCoreProperties, TestSha256AuthenticationProvider, 
> TestExpandComponent, TestCloudRecovery, ConfigureRecoveryStrategyTest, 
> TimeZoneUtilsTest, TestTolerantUpdateProcessorRandomCloud, 
> TestExceedMaxTermLength, PrimitiveFieldTypeTest, SolrInfoBeanTest, 
> FullSolrCloudDistribCmdsTest, UniqFieldsUpdateProcessorFactoryTest, 
> TestReplicaProperties, SolrCloudReportersTest, UnloadDistributedZkTest, 
> CleanupOldIndexTest, LeaderInitiatedRecoveryOnShardRestartTest, 
> FastVectorHighlighterTest, TestOrdValues, MinimalSchemaTest, 
> TestSubQueryTransformerCrossCore, ParsingFieldUpdateProcessorsTest, 
> BufferStoreTest, TestLegacyFieldReuse, CollectionsAPISolrJTest, 
> TestSchemalessBufferedUpdates, ConnectionManagerTest, MetricsConfigTest, 
> TestPayloadScoreQParserPlugin, TermVectorComponentDistributedTest, 
> TestLMJelinekMercerSimilarityFactory, TestFastWriter, MultiTermTest, 
> HdfsBasicDistributedZk2Test, V2StandaloneTest, DocumentBuilderTest, 
> TestMultiValuedNumericRangeQuery, AnalyticsMergeStrategyTest, 
> TestNumericTokenStream, TestFieldCacheSort, 
> TestSolrCloudWithSecureImpersonation, TestBlendedInfixSuggestions, 
> ResponseLogComponentTest, CopyFieldTest, TestAuthenticationFramework, 
> BlockJoinFacetDistribTest, TestFieldSortValues, TestJmxIntegration, 
> SolrTestCaseJ4Test, BasicAuthIntegrationTest, URLClassifyProcessorTest, 
> DateFieldTest, TestExactSharedStatsCache, TestFieldTypeCollectionResource, 
> ExplicitHLLTest, ConjunctionSolrSpellCheckerTest, 
> TestLeaderElectionWithEmptyReplica, TestReloadAndDeleteDocs, 
> ClusterStateTest, TestSQLHandler, HdfsRecoverLeaseTest, QueryEqualityTest, 
> UUIDUpdateProcessorFallbackTest, ClassificationUpdateProcessorFactoryTest, 
> DistributedSuggestComponentTest, TestHalfAndHalfDocValues, 
> ShowFileRequestHandlerTest, ExitableDirectoryReaderTest, 
> TestInfoStreamLogging, TestLocalFSCloudBackupRestore, 
> ChaosMonkeyNothingIsSafeWithPullReplicasTest, TestSort, NumericFieldsTest, 
> DirectUpdateHandlerTest, SuggesterFSTTest, NodeMutatorTest, 
> DateMathParserTest, DistribCursorPagingTest, CircularListTest, 
> CloneFieldUpdateProcessorFactoryTest, 
> OverseerCollectionConfigSetProcessorTest, DateRangeFieldTest, 
> TestDFRSimilarityFactory, XsltUpdateRequestHandlerTest, TestConfigSetsAPI, 
> CoreMergeIndexesAdminHandlerTest, TestSerializedLuceneMatchVersion, 
> TestPrepRecovery, TestNoOpRegenerator, DistributedFacetPivotWhiteBoxTest, 
> LeaderFailoverAfterPartitionTest, TestSolrFieldCacheBean, OverseerTest]
> {noformat}
> Following is the first of 4 failures from my Jenkins in the last 24 hours or 
> so on

[jira] [Commented] (SOLR-11025) OverseerTest.testShardLeaderChange() failures

2017-07-07 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16077749#comment-16077749
 ] 

Scott Blum commented on SOLR-11025:
---

Possible issue: the order or operations in SOLR-10983 is to first remove the 
item from queue, then add it to queue-work, to avoid duplicating an item.  But 
perhaps there's a small chance of losing an update?  Would the other order be 
better?

> OverseerTest.testShardLeaderChange() failures
> -
>
> Key: SOLR-11025
> URL: https://issues.apache.org/jira/browse/SOLR-11025
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Steve Rowe
>
> Non-reproducing Jenkins failure - this test hasn't failed in Jenkins in 
> months, and suddenly several failures within days of each other:
> [https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/18/]:
> {noformat}
> Checking out Revision 986175915927ee2bbd971340f858601c86b3c676 
> (refs/remotes/origin/branch_7x)
> [...]
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=OverseerTest 
> -Dtests.method=testShardLeaderChange -Dtests.seed=995C82D4739EF7D8 
> -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ru 
> -Dtests.timezone=Asia/Calcutta -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
>[junit4] FAILURE  223s J2 | OverseerTest.testShardLeaderChange <<<
>[junit4]> Throwable #1: java.lang.AssertionError: Unexpected shard 
> leader coll:collection1 shard:shard1 expected: but was:
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([995C82D4739EF7D8:470F052369060229]:0)
>[junit4]>  at 
> org.apache.solr.cloud.OverseerTest.verifyShardLeader(OverseerTest.java:486)
>[junit4]>  at 
> org.apache.solr.cloud.OverseerTest.testShardLeaderChange(OverseerTest.java:720)
> [...]
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {}, 
> docValues:{}, maxPointsInLeafNode=892, maxMBSortInHeap=5.965872045375053, 
> sim=RandomSimilarity(queryNorm=false): {}, locale=ru, timezone=Asia/Calcutta
>[junit4]   2> NOTE: Linux 4.10.0-21-generic i386/Oracle Corporation 
> 1.8.0_131 (32-bit)/cpus=8,threads=1,free=96275160,total=293539840
>[junit4]   2> NOTE: All tests run in this JVM: [HardAutoCommitTest, 
> SimplePostToolTest, TestSolrCoreSnapshots, ReplicationFactorTest, 
> TestSolrCoreProperties, TestSha256AuthenticationProvider, 
> TestExpandComponent, TestCloudRecovery, ConfigureRecoveryStrategyTest, 
> TimeZoneUtilsTest, TestTolerantUpdateProcessorRandomCloud, 
> TestExceedMaxTermLength, PrimitiveFieldTypeTest, SolrInfoBeanTest, 
> FullSolrCloudDistribCmdsTest, UniqFieldsUpdateProcessorFactoryTest, 
> TestReplicaProperties, SolrCloudReportersTest, UnloadDistributedZkTest, 
> CleanupOldIndexTest, LeaderInitiatedRecoveryOnShardRestartTest, 
> FastVectorHighlighterTest, TestOrdValues, MinimalSchemaTest, 
> TestSubQueryTransformerCrossCore, ParsingFieldUpdateProcessorsTest, 
> BufferStoreTest, TestLegacyFieldReuse, CollectionsAPISolrJTest, 
> TestSchemalessBufferedUpdates, ConnectionManagerTest, MetricsConfigTest, 
> TestPayloadScoreQParserPlugin, TermVectorComponentDistributedTest, 
> TestLMJelinekMercerSimilarityFactory, TestFastWriter, MultiTermTest, 
> HdfsBasicDistributedZk2Test, V2StandaloneTest, DocumentBuilderTest, 
> TestMultiValuedNumericRangeQuery, AnalyticsMergeStrategyTest, 
> TestNumericTokenStream, TestFieldCacheSort, 
> TestSolrCloudWithSecureImpersonation, TestBlendedInfixSuggestions, 
> ResponseLogComponentTest, CopyFieldTest, TestAuthenticationFramework, 
> BlockJoinFacetDistribTest, TestFieldSortValues, TestJmxIntegration, 
> SolrTestCaseJ4Test, BasicAuthIntegrationTest, URLClassifyProcessorTest, 
> DateFieldTest, TestExactSharedStatsCache, TestFieldTypeCollectionResource, 
> ExplicitHLLTest, ConjunctionSolrSpellCheckerTest, 
> TestLeaderElectionWithEmptyReplica, TestReloadAndDeleteDocs, 
> ClusterStateTest, TestSQLHandler, HdfsRecoverLeaseTest, QueryEqualityTest, 
> UUIDUpdateProcessorFallbackTest, ClassificationUpdateProcessorFactoryTest, 
> DistributedSuggestComponentTest, TestHalfAndHalfDocValues, 
> ShowFileRequestHandlerTest, ExitableDirectoryReaderTest, 
> TestInfoStreamLogging, TestLocalFSCloudBackupRestore, 
> ChaosMonkeyNothingIsSafeWithPullReplicasTest, TestSort, NumericFieldsTest, 
> DirectUpdateHandlerTest, SuggesterFSTTest, NodeMutatorTest, 
> DateMathParserTest, DistribCursorPagingTest, CircularListTest, 
> CloneFieldUpdateProcessorFactoryTest, 
> OverseerCollectionConfigSetProcessorTest, DateRangeFieldTest, 
> TestDFRSimilarityFactory, XsltUpdateRequestHandlerTest, TestConfigSetsAPI, 
> CoreMergeIndexesAdminHandlerTest, TestSerializedLuceneMatchVersion, 
> TestPrepRecovery, TestNoOpRegenerator, DistributedFacetPivotWhiteBoxTest, 
> LeaderFailoverAfterPartitionTest,

[jira] [Commented] (SOLR-10983) Fix DOWNNODE -> queue-work explosion

2017-07-05 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16075666#comment-16075666
 ] 

Scott Blum commented on SOLR-10983:
---

BTW: this issue most likely affects all 6.x releases (and even some late 5.x), 
so it should be considered if we do any 6.x point releases later.

> Fix DOWNNODE -> queue-work explosion
> 
>
> Key: SOLR-10983
> URL: https://issues.apache.org/jira/browse/SOLR-10983
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
> Fix For: 7.0, master (8.0), 7.1
>
> Attachments: SOLR-10983.patch
>
>
> Every DOWNNODE command enqueues N copies of itself into queue-work, where N 
> is number of collections affected by the DOWNNODE.
> This rarely matters in practice, because queue-work gets immediately dumped-- 
> however, if anything throws an exception (such as ZK bad version), we don't 
> clear queue-work.  Then the next time through the loop we run the expensive 
> DOWNNODE command potentially hundreds of times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-10983) Fix DOWNNODE -> queue-work explosion

2017-07-05 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum resolved SOLR-10983.
---
   Resolution: Fixed
Fix Version/s: 7.1
   master (8.0)
   7.0

> Fix DOWNNODE -> queue-work explosion
> 
>
> Key: SOLR-10983
> URL: https://issues.apache.org/jira/browse/SOLR-10983
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
> Fix For: 7.0, master (8.0), 7.1
>
> Attachments: SOLR-10983.patch
>
>
> Every DOWNNODE command enqueues N copies of itself into queue-work, where N 
> is number of collections affected by the DOWNNODE.
> This rarely matters in practice, because queue-work gets immediately dumped-- 
> however, if anything throws an exception (such as ZK bad version), we don't 
> clear queue-work.  Then the next time through the loop we run the expensive 
> DOWNNODE command potentially hundreds of times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10983) Fix DOWNNODE -> queue-work explosion

2017-07-04 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074225#comment-16074225
 ] 

Scott Blum commented on SOLR-10983:
---

Thanks!  Will do

> Fix DOWNNODE -> queue-work explosion
> 
>
> Key: SOLR-10983
> URL: https://issues.apache.org/jira/browse/SOLR-10983
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
> Attachments: SOLR-10983.patch
>
>
> Every DOWNNODE command enqueues N copies of itself into queue-work, where N 
> is number of collections affected by the DOWNNODE.
> This rarely matters in practice, because queue-work gets immediately dumped-- 
> however, if anything throws an exception (such as ZK bad version), we don't 
> clear queue-work.  Then the next time through the loop we run the expensive 
> DOWNNODE command potentially hundreds of times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10276) Update ZK leader election so that leader notices if its leadership is revoked

2017-06-30 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070340#comment-16070340
 ] 

Scott Blum commented on SOLR-10276:
---

Please add a patch!  This is super useful operationally.

> Update ZK leader election so that leader notices if its leadership is revoked
> -
>
> Key: SOLR-10276
> URL: https://issues.apache.org/jira/browse/SOLR-10276
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 5.5.3
>Reporter: Joshua Humphries
>Assignee: Scott Blum
>Priority: Minor
>  Labels: leader, zookeeper
>
> When we have an issue with a solr node, it would be nice to revoke its 
> leadership of one or more shard or to revoke its role as overseer without 
> actually restarting the node. (Restarting the node tends to spam the overseer 
> queue since we have a very large number of cores per node.)
> Operationally, it would be nice if one could just delete the leader's 
> election node (e.g. its ephemeral sequential node that indicates it as 
> current leader) and to have it notice the change and stop behaving as leader.
> Currently, once a node becomes leader, it isn't watching ZK for any changes 
> that could revoke its leadership. I am proposing that, upon being elected 
> leader, it use a ZK watch to monitor its own election node. If its own 
> election node is deleted, it then relinquishes leadership (e.g. calls 
> ElectionContext#cancelElection() and then re-joins the election).
> I have a patch with tests that I can contribute.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (SOLR-10276) Update ZK leader election so that leader notices if its leadership is revoked

2017-06-30 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum reassigned SOLR-10276:
-

Assignee: Scott Blum

> Update ZK leader election so that leader notices if its leadership is revoked
> -
>
> Key: SOLR-10276
> URL: https://issues.apache.org/jira/browse/SOLR-10276
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 5.5.3
>Reporter: Joshua Humphries
>Assignee: Scott Blum
>Priority: Minor
>  Labels: leader, zookeeper
>
> When we have an issue with a solr node, it would be nice to revoke its 
> leadership of one or more shard or to revoke its role as overseer without 
> actually restarting the node. (Restarting the node tends to spam the overseer 
> queue since we have a very large number of cores per node.)
> Operationally, it would be nice if one could just delete the leader's 
> election node (e.g. its ephemeral sequential node that indicates it as 
> current leader) and to have it notice the change and stop behaving as leader.
> Currently, once a node becomes leader, it isn't watching ZK for any changes 
> that could revoke its leadership. I am proposing that, upon being elected 
> leader, it use a ZK watch to monitor its own election node. If its own 
> election node is deleted, it then relinquishes leadership (e.g. calls 
> ElectionContext#cancelElection() and then re-joins the election).
> I have a patch with tests that I can contribute.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10983) Fix DOWNNODE -> queue-work explosion

2017-06-29 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069503#comment-16069503
 ] 

Scott Blum commented on SOLR-10983:
---

[~shalinmangar] [~jhump]

> Fix DOWNNODE -> queue-work explosion
> 
>
> Key: SOLR-10983
> URL: https://issues.apache.org/jira/browse/SOLR-10983
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
> Attachments: SOLR-10983.patch
>
>
> Every DOWNNODE command enqueues N copies of itself into queue-work, where N 
> is number of collections affected by the DOWNNODE.
> This rarely matters in practice, because queue-work gets immediately dumped-- 
> however, if anything throws an exception (such as ZK bad version), we don't 
> clear queue-work.  Then the next time through the loop we run the expensive 
> DOWNNODE command potentially hundreds of times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10983) Fix DOWNNODE -> queue-work explosion

2017-06-29 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10983:
--
Attachment: SOLR-10983.patch

> Fix DOWNNODE -> queue-work explosion
> 
>
> Key: SOLR-10983
> URL: https://issues.apache.org/jira/browse/SOLR-10983
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Scott Blum
>Assignee: Scott Blum
> Attachments: SOLR-10983.patch
>
>
> Every DOWNNODE command enqueues N copies of itself into queue-work, where N 
> is number of collections affected by the DOWNNODE.
> This rarely matters in practice, because queue-work gets immediately dumped-- 
> however, if anything throws an exception (such as ZK bad version), we don't 
> clear queue-work.  Then the next time through the loop we run the expensive 
> DOWNNODE command potentially hundreds of times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-10983) Fix DOWNNODE -> queue-work explosion

2017-06-29 Thread Scott Blum (JIRA)

Scott Blum created SOLR-10983:
-

 Summary: Fix DOWNNODE -> queue-work explosion
 Key: SOLR-10983
 URL: https://issues.apache.org/jira/browse/SOLR-10983
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrCloud
Reporter: Scott Blum
Assignee: Scott Blum


Every DOWNNODE command enqueues N copies of itself into queue-work, where N is 
number of collections affected by the DOWNNODE.

This rarely matters in practice, because queue-work gets immediately dumped-- 
however, if anything throws an exception (such as ZK bad version), we don't 
clear queue-work.  Then the next time through the loop we run the expensive 
DOWNNODE command potentially hundreds of times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10619) Optimize using cache for DQ.peek(), DQ.poll() in case of single-consumer

2017-05-08 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10619:
--
Attachment: SOLR-10619-dragonsinth.patch

What do you think about this?

> Optimize using cache for DQ.peek(), DQ.poll() in case of single-consumer
> 
>
> Key: SOLR-10619
> URL: https://issues.apache.org/jira/browse/SOLR-10619
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-10619-dragonsinth.patch, SOLR-10619.patch, 
> SOLR-10619.patch, SOLR-10619.patch, SOLR-10619.patch
>
>
> Right now, for every time childWatcher is kicked off. We refetch all children 
> of DQ's node. It is a wasted in case of single consumer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10619) Optimize using cache for DQ.peek(), DQ.poll() in case of single-consumer

2017-05-08 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002053#comment-16002053
 ] 

Scott Blum commented on SOLR-10619:
---

[~caomanhdat] Okay, I see what's going on here.  Your change looks good.  
Although... I think maybe we should change peekElements so that it only forces 
a refresh if it's going to have to block?

> Optimize using cache for DQ.peek(), DQ.poll() in case of single-consumer
> 
>
> Key: SOLR-10619
> URL: https://issues.apache.org/jira/browse/SOLR-10619
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-10619.patch, SOLR-10619.patch, SOLR-10619.patch, 
> SOLR-10619.patch
>
>
> Right now, for every time childWatcher is kicked off. We refetch all children 
> of DQ's node. It is a wasted in case of single consumer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10619) Optimize using cache for DQ.peek(), DQ.poll() in case of single-consumer

2017-05-08 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002013#comment-16002013
 ] 

Scott Blum commented on SOLR-10619:
---

Hmm lemme take a look.  Maybe that test should just have different expectations?

> Optimize using cache for DQ.peek(), DQ.poll() in case of single-consumer
> 
>
> Key: SOLR-10619
> URL: https://issues.apache.org/jira/browse/SOLR-10619
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-10619.patch, SOLR-10619.patch, SOLR-10619.patch, 
> SOLR-10619.patch
>
>
> Right now, for every time childWatcher is kicked off. We refetch all children 
> of DQ's node. It is a wasted in case of single consumer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10619) When DQ.knowChildren is not empty, DQ should not refetch node children in case of single consumer

2017-05-08 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001872#comment-16001872
 ] 

Scott Blum commented on SOLR-10619:
---

[~caomanhdat] patch LGTM.  I would update the main class comment on 
DistributedQueue to read:

"A distributed queue.  Optimized for single-consumer, multiple-producer: if 
there are multiple consumers on the same ZK queue, the results should be 
correct but inefficient."

And then where you call "knownChildren.clear();" you could add a comment 
"Efficient only for single-consumer"

> When DQ.knowChildren is not empty, DQ should not refetch node children in 
> case of single consumer
> -
>
> Key: SOLR-10619
> URL: https://issues.apache.org/jira/browse/SOLR-10619
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-10619.patch, SOLR-10619.patch, SOLR-10619.patch
>
>
> Right now, for every time childWatcher is kicked off. We refetch all children 
> of DQ's node. It is a wasted in case of single consumer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10524) Better ZkStateWriter batching

2017-05-08 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001856#comment-16001856
 ] 

Scott Blum commented on SOLR-10524:
---

Thanks [~caomanhdat], committed to master.

> Better ZkStateWriter batching
> -
>
> Key: SOLR-10524
> URL: https://issues.apache.org/jira/browse/SOLR-10524
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Erick Erickson
> Attachments: SOLR-10524-dragonsinth.patch, SOLR-10524-NPE-fix.patch, 
> SOLR-10524.patch, SOLR-10524.patch, SOLR-10524.patch, SOLR-10524.patch
>
>
> There are several JIRAs (I'll link in a second) about trying to be more 
> efficient about processing overseer messages as the overseer can become a 
> bottleneck, especially with very large numbers of replicas in a cluster. One 
> of the approaches mentioned near the end of SOLR-5872 (15-Mar) was to "read 
> large no:of items say 1. put them into in memory buckets and feed them 
> into overseer".
> This JIRA is to break out that part of the discussion as it might be an easy 
> win whereas "eliminating the Overseer queue" would be quite an undertaking.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10524) Better ZkStateWriter batching

2017-05-08 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10524:
--
Summary: Better ZkStateWriter batching  (was: Explore in-memory 
partitioning for processing Overseer queue messages)

> Better ZkStateWriter batching
> -
>
> Key: SOLR-10524
> URL: https://issues.apache.org/jira/browse/SOLR-10524
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Erick Erickson
> Attachments: SOLR-10524-dragonsinth.patch, SOLR-10524-NPE-fix.patch, 
> SOLR-10524.patch, SOLR-10524.patch, SOLR-10524.patch, SOLR-10524.patch
>
>
> There are several JIRAs (I'll link in a second) about trying to be more 
> efficient about processing overseer messages as the overseer can become a 
> bottleneck, especially with very large numbers of replicas in a cluster. One 
> of the approaches mentioned near the end of SOLR-5872 (15-Mar) was to "read 
> large no:of items say 1. put them into in memory buckets and feed them 
> into overseer".
> This JIRA is to break out that part of the discussion as it might be an easy 
> win whereas "eliminating the Overseer queue" would be quite an undertaking.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10524) Explore in-memory partitioning for processing Overseer queue messages

2017-05-08 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10524:
--
Attachment: SOLR-10524-dragonsinth.patch

Fixes ZkStateWriterTest, etc

> Explore in-memory partitioning for processing Overseer queue messages
> -
>
> Key: SOLR-10524
> URL: https://issues.apache.org/jira/browse/SOLR-10524
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Erick Erickson
> Attachments: SOLR-10524-dragonsinth.patch, SOLR-10524-NPE-fix.patch, 
> SOLR-10524.patch, SOLR-10524.patch, SOLR-10524.patch, SOLR-10524.patch
>
>
> There are several JIRAs (I'll link in a second) about trying to be more 
> efficient about processing overseer messages as the overseer can become a 
> bottleneck, especially with very large numbers of replicas in a cluster. One 
> of the approaches mentioned near the end of SOLR-5872 (15-Mar) was to "read 
> large no:of items say 1. put them into in memory buckets and feed them 
> into overseer".
> This JIRA is to break out that part of the discussion as it might be an easy 
> win whereas "eliminating the Overseer queue" would be quite an undertaking.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10524) Explore in-memory partitioning for processing Overseer queue messages

2017-05-08 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001810#comment-16001810
 ] 

Scott Blum commented on SOLR-10524:
---

ZkStateWriterTests.testZkStateWriterBatching() is written for exactly the 
behavior we wanted to change here. :)  That test needs an overhaul.  Patch 
forthcoming.

> Explore in-memory partitioning for processing Overseer queue messages
> -
>
> Key: SOLR-10524
> URL: https://issues.apache.org/jira/browse/SOLR-10524
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Erick Erickson
> Attachments: SOLR-10524-NPE-fix.patch, SOLR-10524.patch, 
> SOLR-10524.patch, SOLR-10524.patch, SOLR-10524.patch
>
>
> There are several JIRAs (I'll link in a second) about trying to be more 
> efficient about processing overseer messages as the overseer can become a 
> bottleneck, especially with very large numbers of replicas in a cluster. One 
> of the approaches mentioned near the end of SOLR-5872 (15-Mar) was to "read 
> large no:of items say 1. put them into in memory buckets and feed them 
> into overseer".
> This JIRA is to break out that part of the discussion as it might be an easy 
> win whereas "eliminating the Overseer queue" would be quite an undertaking.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10619) When DQ.knowChildren is not empty, DQ should not refetch node children in case of single consumer

2017-05-07 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000304#comment-16000304
 ] 

Scott Blum commented on SOLR-10619:
---

I'll take another look tomorrow.

> When DQ.knowChildren is not empty, DQ should not refetch node children in 
> case of single consumer
> -
>
> Key: SOLR-10619
> URL: https://issues.apache.org/jira/browse/SOLR-10619
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-10619.patch, SOLR-10619.patch
>
>
> Right now, for every time childWatcher is kicked off. We refetch all children 
> of DQ's node. It is a wasted in case of single consumer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10285) Reduce state messages when there are leader only shards

2017-05-07 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000303#comment-16000303
 ] 

Scott Blum commented on SOLR-10285:
---

[~jhump] did you have a patch for this?  or did we only discuss it?

> Reduce state messages when there are leader only shards
> ---
>
> Key: SOLR-10285
> URL: https://issues.apache.org/jira/browse/SOLR-10285
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Assignee: Cao Manh Dat
>
> For shards which have 1 replica ( leader ) we know it doesn't need to recover 
> from anyone. We should short-circuit the recovery process in this case. 
> The motivation for this being that we will generate less state events and be 
> able to mark these replicas as active again without it needing to go into 
> 'recovering' state. 
> We already short circuit when you set {{-Dsolrcloud.skip.autorecovery=true}} 
> but that sys prop was meant for tests only. Extending this to make sure the 
> code short-circuits when the core knows its the only replica in the shard is 
> the motivation of the Jira.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10619) When DQ.knowChildren is not empty, DQ should not refetch node children in case of single consumer

2017-05-06 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999680#comment-15999680
 ] 

Scott Blum commented on SOLR-10619:
---

Sounds good!

> When DQ.knowChildren is not empty, DQ should not refetch node children in 
> case of single consumer
> -
>
> Key: SOLR-10619
> URL: https://issues.apache.org/jira/browse/SOLR-10619
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-10619.patch, SOLR-10619.patch
>
>
> Right now, for every time childWatcher is kicked off. We refetch all children 
> of DQ's node. It is a wasted in case of single consumer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10619) When DQ.knowChildren is not empty, DQ should not refetch node children in case of single consumer

2017-05-06 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999652#comment-15999652
 ] 

Scott Blum commented on SOLR-10619:
---

Good thought.  I was debating the same thing in my head.  In normal Solr 
operation, it would be an exceptional case, since Overseer is single-consumer, 
it should almost never be the case that resolving the head node to ZK fails.  
But for the sake of generality, I can see both sides:

- If you continue trying to iterate through in-memory children, you might hit a 
really long list of failures (many, many round trips to ZK) before hitting the 
next live child.  Theoretically, re-fetching the child list from ZK would be 
faster to get a fresh view.

- But, in a true multi-consumer use case, re-fetching the child list every time 
you get a "miss" because someone else consumed ahead of you would cause the 
many consumers to thrash a lot.

So I'm a little torn, since option 1 might be slightly better for Overseer in 
particular, but option 2 is clearly more general.  Thoughts?

> When DQ.knowChildren is not empty, DQ should not refetch node children in 
> case of single consumer
> -
>
> Key: SOLR-10619
> URL: https://issues.apache.org/jira/browse/SOLR-10619
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-10619.patch, SOLR-10619.patch
>
>
> Right now, for every time childWatcher is kicked off. We refetch all children 
> of DQ's node. It is a wasted in case of single consumer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10619) When DQ.knowChildren is not empty, DQ should not refetch node children in case of single consumer

2017-05-06 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999586#comment-15999586
 ] 

Scott Blum commented on SOLR-10619:
---

I completely follow your logic.  But then, why do we need a singleConsumer 
flag?  Couldn't we just do the optimization in all cases?

In other words, let's say we're dirty but we have children in memory.  Fine, 
let's optimistically fetch the first child's data based on the node name we 
have in memory.  If it succeeds, we're done, just return it.

But if it fails, if there's no node then, and we're in a dirty state, THEN we 
can refetch the child list from ZK.

> When DQ.knowChildren is not empty, DQ should not refetch node children in 
> case of single consumer
> -
>
> Key: SOLR-10619
> URL: https://issues.apache.org/jira/browse/SOLR-10619
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-10619.patch, SOLR-10619.patch
>
>
> Right now, for every time childWatcher is kicked off. We refetch all children 
> of DQ's node. It is a wasted in case of single consumer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10619) When DQ.knowChildren is not empty, DQ should not refetch node children in case of single consumer

2017-05-05 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999299#comment-15999299
 ] 

Scott Blum commented on SOLR-10619:
---

What's the whole "singleConsumer" thing?  Can you just give me an overview of 
what's happening here?

> When DQ.knowChildren is not empty, DQ should not refetch node children in 
> case of single consumer
> -
>
> Key: SOLR-10619
> URL: https://issues.apache.org/jira/browse/SOLR-10619
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
> Attachments: SOLR-10619.patch, SOLR-10619.patch
>
>
> Right now, for every time childWatcher is kicked off. We refetch all children 
> of DQ's node. It is a wasted in case of single consumer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10524) Explore in-memory partitioning for processing Overseer queue messages

2017-05-04 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997804#comment-15997804
 ] 

Scott Blum commented on SOLR-10524:
---

Updated patch DistributedQueue LGTM

> Explore in-memory partitioning for processing Overseer queue messages
> -
>
> Key: SOLR-10524
> URL: https://issues.apache.org/jira/browse/SOLR-10524
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Erick Erickson
> Attachments: SOLR-10524.patch, SOLR-10524.patch, SOLR-10524.patch
>
>
> There are several JIRAs (I'll link in a second) about trying to be more 
> efficient about processing overseer messages as the overseer can become a 
> bottleneck, especially with very large numbers of replicas in a cluster. One 
> of the approaches mentioned near the end of SOLR-5872 (15-Mar) was to "read 
> large no:of items say 1. put them into in memory buckets and feed them 
> into overseer".
> This JIRA is to break out that part of the discussion as it might be an easy 
> win whereas "eliminating the Overseer queue" would be quite an undertaking.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10524) Explore in-memory partitioning for processing Overseer queue messages

2017-05-04 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997500#comment-15997500
 ] 

Scott Blum commented on SOLR-10524:
---

Couple of thoughts:

1) In the places where you've changed Collection -> List, I would go one step 
further and make it a concrete ArrayList, to a) explicitly convey that the 
returned list is a mutable copy rather than a view of internal state and b) 
explicitly convey that sortAndAdd() is operating efficiently on said lists.

2) DQ.remove(id): don't you need to unconditionally knownChildren.remove(id), 
even if the ZK delete succeeds?

3) DQ.remove(id): there is no need to loop here, in fact you'll get stuck in an 
infinite loop if someone else deletes the node you're targeting.  The reason 
there's a loop in removeFirst() is because it's trying a different id each 
iteration.

Suggested remove(id) impl:

{code}
  public void remove(String id) throws KeeperException, InterruptedException {
// Remove the ZK node *first*; ZK will resolve any races with peek()/poll().
// This is counterintuitive, but peek()/poll() will not return an element 
if the underlying
// ZK node has been deleted, so it's okay to update knownChildren 
afterwards.
try {
  String path = dir + "/" + id;
  zookeeper.delete(path, -1, true);
} catch (KeeperException.NoNodeException e) {
  // Another client deleted the node first, this is fine.
}
updateLock.lockInterruptibly();
try {
  knownChildren.remove(id);
} finally {
  updateLock.unlock();
}
  }
{code}


> Explore in-memory partitioning for processing Overseer queue messages
> -
>
> Key: SOLR-10524
> URL: https://issues.apache.org/jira/browse/SOLR-10524
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Erick Erickson
> Attachments: SOLR-10524.patch, SOLR-10524.patch
>
>
> There are several JIRAs (I'll link in a second) about trying to be more 
> efficient about processing overseer messages as the overseer can become a 
> bottleneck, especially with very large numbers of replicas in a cluster. One 
> of the approaches mentioned near the end of SOLR-5872 (15-Mar) was to "read 
> large no:of items say 1. put them into in memory buckets and feed them 
> into overseer".
> This JIRA is to break out that part of the discussion as it might be an easy 
> win whereas "eliminating the Overseer queue" would be quite an undertaking.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-10524) Explore in-memory partitioning for processing Overseer queue messages

2017-05-04 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997500#comment-15997500
 ] 

Scott Blum edited comment on SOLR-10524 at 5/4/17 9:56 PM:
---

Couple of thoughts:

1) In the places where you've changed Collection -> List, I would go one step 
further and make it a concrete ArrayList, to a) explicitly convey that the 
returned list is a mutable copy rather than a view of internal state and b) 
explicitly convey that sortAndAdd() is operating efficiently on said lists.

2) DQ.remove(id): don't you want to unconditionally knownChildren.remove(id), 
even if the ZK delete succeeds?

3) DQ.remove(id): there is no need to loop here, in fact you'll get stuck in an 
infinite loop if someone else deletes the node you're targeting.  The reason 
there's a loop in removeFirst() is because it's trying a different id each 
iteration.

Suggested remove(id) impl:

{code}
  public void remove(String id) throws KeeperException, InterruptedException {
// Remove the ZK node *first*; ZK will resolve any races with peek()/poll().
// This is counterintuitive, but peek()/poll() will not return an element 
if the underlying
// ZK node has been deleted, so it's okay to update knownChildren 
afterwards.
try {
  String path = dir + "/" + id;
  zookeeper.delete(path, -1, true);
} catch (KeeperException.NoNodeException e) {
  // Another client deleted the node first, this is fine.
}
updateLock.lockInterruptibly();
try {
  knownChildren.remove(id);
} finally {
  updateLock.unlock();
}
  }
{code}



was (Author: dragonsinth):
Couple of thoughts:

1) In the places where you've changed Collection -> List, I would go one step 
further and make it a concrete ArrayList, to a) explicitly convey that the 
returned list is a mutable copy rather than a view of internal state and b) 
explicitly convey that sortAndAdd() is operating efficiently on said lists.

2) DQ.remove(id): don't you need to unconditionally knownChildren.remove(id), 
even if the ZK delete succeeds?

3) DQ.remove(id): there is no need to loop here, in fact you'll get stuck in an 
infinite loop if someone else deletes the node you're targeting.  The reason 
there's a loop in removeFirst() is because it's trying a different id each 
iteration.

Suggested remove(id) impl:

{code}
  public void remove(String id) throws KeeperException, InterruptedException {
// Remove the ZK node *first*; ZK will resolve any races with peek()/poll().
// This is counterintuitive, but peek()/poll() will not return an element 
if the underlying
// ZK node has been deleted, so it's okay to update knownChildren 
afterwards.
try {
  String path = dir + "/" + id;
  zookeeper.delete(path, -1, true);
} catch (KeeperException.NoNodeException e) {
  // Another client deleted the node first, this is fine.
}
updateLock.lockInterruptibly();
try {
  knownChildren.remove(id);
} finally {
  updateLock.unlock();
}
  }
{code}


> Explore in-memory partitioning for processing Overseer queue messages
> -
>
> Key: SOLR-10524
> URL: https://issues.apache.org/jira/browse/SOLR-10524
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Erick Erickson
> Attachments: SOLR-10524.patch, SOLR-10524.patch
>
>
> There are several JIRAs (I'll link in a second) about trying to be more 
> efficient about processing overseer messages as the overseer can become a 
> bottleneck, especially with very large numbers of replicas in a cluster. One 
> of the approaches mentioned near the end of SOLR-5872 (15-Mar) was to "read 
> large no:of items say 1. put them into in memory buckets and feed them 
> into overseer".
> This JIRA is to break out that part of the discussion as it might be an easy 
> win whereas "eliminating the Overseer queue" would be quite an undertaking.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-18 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum resolved SOLR-10420.
---
Resolution: Fixed

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Fix For: 5.5.5, 5.6, 6.5.1, 6.6, master (7.0)
>
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-18 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973346#comment-15973346
 ] 

Scott Blum commented on SOLR-10420:
---

Got it.  I do think it would be a mistake.  In that case, after I've committed 
to 5x and 6x, I'll also commit to 6_5 and 5_5.

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Fix For: 5.5.5, 5.6, 6.5.1, 6.6, master (7.0)
>
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-18 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10420:
--
Fix Version/s: (was: 6.4.3)

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Fix For: 5.5.5, 5.6, 6.5.1, 6.6, master (7.0)
>
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-18 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10420:
--
Fix Version/s: 5.6

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Fix For: 5.5.5, 5.6, 6.4.3, 6.5.1, 6.6, master (7.0)
>
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-18 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973320#comment-15973320
 ] 

Scott Blum commented on SOLR-10420:
---

So my plan is to commit this to master, branch_6x, and branch_5x, and let the 
release managers pull it into the actual release branches.  SG?

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Fix For: 5.5.5, 6.4.3, 6.5.1, 6.6, master (7.0)
>
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-18 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10420:
--
Fix Version/s: 6.6

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Fix For: 5.5.5, 6.4.3, 6.5.1, 6.6, master (7.0)
>
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-18 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10420:
--
Fix Version/s: master (7.0)
   6.5.1
   6.4.3
   5.5.5

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Fix For: 5.5.5, 6.4.3, 6.5.1, master (7.0)
>
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-18 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973246#comment-15973246
 ] 

Scott Blum commented on SOLR-10420:
---

Great!  Thanks [~steve_rowe]!  I'll get this committed.

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-18 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973093#comment-15973093
 ] 

Scott Blum edited comment on SOLR-10420 at 4/18/17 5:14 PM:


Agreed.  It passes for me.  Anyone on this issue want to do any extensive 
testing of 
https://issues.apache.org/jira/secure/attachment/12863715/SOLR-10420-dragonsinth.patch
 before I commit?  Otherwise I'll commit this today to master and then start 
backporting it to a number of branches.


was (Author: dragonsinth):
Agreed.  It passes for me.  Anyone on this issue want to do any extensive 
testing before I commit?  Otherwise I'll commit this today to master and then 
start backporting it to a number of branches.

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-18 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973093#comment-15973093
 ] 

Scott Blum commented on SOLR-10420:
---

Agreed.  It passes for me.  Anyone on this issue want to do any extensive 
testing before I commit?  Otherwise I'll commit this today to master and then 
start backporting it to a number of branches.

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-17 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10420:
--
Attachment: (was: SOLR-10420-dragonsinth.patch)

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-17 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10420:
--
Attachment: SOLR-10420-dragonsinth.patch

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-17 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum reassigned SOLR-10420:
-

Assignee: Scott Blum

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
>Assignee: Scott Blum
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-17 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10420:
--
Attachment: SOLR-10420-dragonsinth.patch

[~caomanhdat] [~jhump] I think this may be the right approach after reviewing 
the overall design.  I don't see any real reason to specifically track 
lastWatcher, we just need to ensure that no more than one is ever set.  And 
having lastWatcher serve double-duty was a misdesign on my part.  There are 
really two separate stateful questions to answer:

1) Is there a watcher set?
2) Are we known to be dirty?

The answer to those two questions is not the same if we want to support 
same-thread synchronous offer -> poll working as you would want.  So this patch 
tracks them separately.

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, 
> SOLR-10420-dragonsinth.patch, SOLR-10420.patch, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-17 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971259#comment-15971259
 ] 

Scott Blum commented on SOLR-10420:
---

[~caomanhdat] I didn't literally mean that we should bring back the isDirty 
bit.  I meant that clearly the last time around, there was a hole in the design 
that led to this leak.  I want to take the opportunity to re-look at the design 
again as a whole and make sure everything seems good, and we're not just 
putting a band-aid on it.  You may have already done this, so just give me a 
little bit to catch up. :D


> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-15 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970030#comment-15970030
 ] 

Scott Blum commented on SOLR-10420:
---

Let me try to unpack what you said..

1) We want a synchronous offer() -> peek() on the same thread to return the 
item offered without delay.
2) This works on master, but the original patch to fix the leak breaks #1.

Is that correct?

Let me look at this on Monday with [~jhump].  I'm pretty sure there's a 
simplification to be made in DQ with how we're handling the watcher and dirty 
tracking.  There used to be an explicit "isDirty" bit that we traded out for 
watcher nullability, which in retrospect I'm not sure was the best choice.

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.4.2, 6.5
>Reporter: Markus Jelsma
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-13 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968618#comment-15968618
 ] 

Scott Blum commented on SOLR-10420:
---

Fix LGTM.

Is this actual fix this?

```
  // we're not in a dirty state, and we do not have in-memory children
  if (lastWatcher != null) return null;
```

IE, if you just do that, would that fix the leak even without reusing the same 
object?


> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.2, 6.5, 6.4.2
>Reporter: Markus Jelsma
> Attachments: OverseerTest.106.stdout, OverseerTest.119.stdout, 
> OverseerTest.80.stdout, OverseerTest.DEBUG.43.stdout, 
> OverseerTest.DEBUG.48.stdout, OverseerTest.DEBUG.58.stdout, SOLR-10420.patch, 
> SOLR-10420.patch, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10277) On 'downnode', lots of wasteful mutations are done to ZK

2017-04-05 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15957285#comment-15957285
 ] 

Scott Blum commented on SOLR-10277:
---

Thank you Shalin!

> On 'downnode', lots of wasteful mutations are done to ZK
> 
>
> Key: SOLR-10277
> URL: https://issues.apache.org/jira/browse/SOLR-10277
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 5.5.3, 5.5.4, 6.0.1, 6.2.1, 6.3, 6.4.2
>Reporter: Joshua Humphries
>Assignee: Scott Blum
>  Labels: leader, zookeeper
> Fix For: master (7.0), 6.5.1
>
> Attachments: SOLR-10277-5.5.3.patch, SOLR-10277.patch, 
> SOLR-10277.patch
>
>
> When a node restarts, it submits a single 'downnode' message to the 
> overseer's state update queue.
> When the overseer processes the message, it does way more writes to ZK than 
> necessary. In our cluster of 48 hosts, the majority of collections have only 
> 1 shard and 1 replica. So a single node restarting should only result in 
> ~1/40th of the collections being updated with new replica states (to indicate 
> the node that is no longer active).
> However, the current logic in NodeMutator#downNode always updates *every* 
> collection. So we end up having to do rolling restarts very slowly to avoid 
> having a severe outage due to the overseer having to do way too much work for 
> each host that is restarted. And subsequent shards becoming leader can't get 
> processed until the `downnode` message is fully processed. So a fast rolling 
> restart can result in the overseer queue growing incredibly large and nearly 
> all shards winding up in a leader-less state until that backlog is processed.
> The fix is a trivial logic change to only add a ZkWriteCommand for 
> collections that actually have an impacted replica.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-04 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955283#comment-15955283
 ] 

Scott Blum commented on SOLR-10420:
---

Hard to see how the problem could be localized to 
DistributedQueue$ChildWatcher.. it doesn't create any ZkClients, it's passed in 
from the outside.

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.5, 6.4.2
>Reporter: Markus Jelsma
> Fix For: master (7.0), branch_6x
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10277) On 'downnode', lots of wasteful mutations are done to ZK

2017-04-04 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955274#comment-15955274
 ] 

Scott Blum commented on SOLR-10277:
---

Agreed, [~shalinmangar] I'm actually OOO all week-- if you wanted to take point 
on getting this landed that would be super.  I reviewed all the live code 
previously, but not [~varunthacker]'s patch to the test.  (though to be honest 
I'm not super familiar with the test frameworks anyway)

> On 'downnode', lots of wasteful mutations are done to ZK
> 
>
> Key: SOLR-10277
> URL: https://issues.apache.org/jira/browse/SOLR-10277
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 5.5.3, 5.5.4, 6.0.1, 6.2.1, 6.3, 6.4.2
>Reporter: Joshua Humphries
>Assignee: Scott Blum
>  Labels: leader, zookeeper
> Attachments: SOLR-10277-5.5.3.patch, SOLR-10277.patch
>
>
> When a node restarts, it submits a single 'downnode' message to the 
> overseer's state update queue.
> When the overseer processes the message, it does way more writes to ZK than 
> necessary. In our cluster of 48 hosts, the majority of collections have only 
> 1 shard and 1 replica. So a single node restarting should only result in 
> ~1/40th of the collections being updated with new replica states (to indicate 
> the node that is no longer active).
> However, the current logic in NodeMutator#downNode always updates *every* 
> collection. So we end up having to do rolling restarts very slowly to avoid 
> having a severe outage due to the overseer having to do way too much work for 
> each host that is restarted. And subsequent shards becoming leader can't get 
> processed until the `downnode` message is fully processed. So a fast rolling 
> restart can result in the overseer queue growing incredibly large and nearly 
> all shards winding up in a leader-less state until that backlog is processed.
> The fix is a trivial logic change to only add a ZkWriteCommand for 
> collections that actually have an impacted replica.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10277) On 'downnode', lots of wasteful mutations are done to ZK

2017-03-29 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948316#comment-15948316
 ] 

Scott Blum commented on SOLR-10277:
---

Sure thing, no rush!  Would love to get more eyes + miles on it.

> On 'downnode', lots of wasteful mutations are done to ZK
> 
>
> Key: SOLR-10277
> URL: https://issues.apache.org/jira/browse/SOLR-10277
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 5.5.3, 5.5.4, 6.0.1, 6.2.1, 6.3, 6.4.2
>Reporter: Joshua Humphries
>Assignee: Scott Blum
>  Labels: leader, zookeeper
> Attachments: SOLR-10277-5.5.3.patch
>
>
> When a node restarts, it submits a single 'downnode' message to the 
> overseer's state update queue.
> When the overseer processes the message, it does way more writes to ZK than 
> necessary. In our cluster of 48 hosts, the majority of collections have only 
> 1 shard and 1 replica. So a single node restarting should only result in 
> ~1/40th of the collections being updated with new replica states (to indicate 
> the node that is no longer active).
> However, the current logic in NodeMutator#downNode always updates *every* 
> collection. So we end up having to do rolling restarts very slowly to avoid 
> having a severe outage due to the overseer having to do way too much work for 
> each host that is restarted. And subsequent shards becoming leader can't get 
> processed until the `downnode` message is fully processed. So a fast rolling 
> restart can result in the overseer queue growing incredibly large and nearly 
> all shards winding up in a leader-less state until that backlog is processed.
> The fix is a trivial logic change to only add a ZkWriteCommand for 
> collections that actually have an impacted replica.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (SOLR-10277) On 'downnode', lots of wasteful mutations are done to ZK

2017-03-29 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum reassigned SOLR-10277:
-

Assignee: Scott Blum

> On 'downnode', lots of wasteful mutations are done to ZK
> 
>
> Key: SOLR-10277
> URL: https://issues.apache.org/jira/browse/SOLR-10277
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 5.5.3, 5.5.4, 6.0.1, 6.2.1, 6.3, 6.4.2
>Reporter: Joshua Humphries
>Assignee: Scott Blum
>  Labels: leader, zookeeper
> Attachments: SOLR-10277-5.5.3.patch
>
>
> When a node restarts, it submits a single 'downnode' message to the 
> overseer's state update queue.
> When the overseer processes the message, it does way more writes to ZK than 
> necessary. In our cluster of 48 hosts, the majority of collections have only 
> 1 shard and 1 replica. So a single node restarting should only result in 
> ~1/40th of the collections being updated with new replica states (to indicate 
> the node that is no longer active).
> However, the current logic in NodeMutator#downNode always updates *every* 
> collection. So we end up having to do rolling restarts very slowly to avoid 
> having a severe outage due to the overseer having to do way too much work for 
> each host that is restarted. And subsequent shards becoming leader can't get 
> processed until the `downnode` message is fully processed. So a fast rolling 
> restart can result in the overseer queue growing incredibly large and nearly 
> all shards winding up in a leader-less state until that backlog is processed.
> The fix is a trivial logic change to only add a ZkWriteCommand for 
> collections that actually have an impacted replica.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-10277) On 'downnode', lots of wasteful mutations are done to ZK

2017-03-29 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-10277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948297#comment-15948297
 ] 

Scott Blum commented on SOLR-10277:
---

Patch LGTM, any objections to merging this?

> On 'downnode', lots of wasteful mutations are done to ZK
> 
>
> Key: SOLR-10277
> URL: https://issues.apache.org/jira/browse/SOLR-10277
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 5.5.3, 5.5.4, 6.0.1, 6.2.1, 6.3, 6.4.2
>Reporter: Joshua Humphries
>Assignee: Scott Blum
>  Labels: leader, zookeeper
> Attachments: SOLR-10277-5.5.3.patch
>
>
> When a node restarts, it submits a single 'downnode' message to the 
> overseer's state update queue.
> When the overseer processes the message, it does way more writes to ZK than 
> necessary. In our cluster of 48 hosts, the majority of collections have only 
> 1 shard and 1 replica. So a single node restarting should only result in 
> ~1/40th of the collections being updated with new replica states (to indicate 
> the node that is no longer active).
> However, the current logic in NodeMutator#downNode always updates *every* 
> collection. So we end up having to do rolling restarts very slowly to avoid 
> having a severe outage due to the overseer having to do way too much work for 
> each host that is restarted. And subsequent shards becoming leader can't get 
> processed until the `downnode` message is fully processed. So a fast rolling 
> restart can result in the overseer queue growing incredibly large and nearly 
> all shards winding up in a leader-less state until that backlog is processed.
> The fix is a trivial logic change to only add a ZkWriteCommand for 
> collections that actually have an impacted replica.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10277) On 'downnode', lots of wasteful mutations are done to ZK

2017-03-14 Thread Scott Blum (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-10277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Blum updated SOLR-10277:
--
Affects Version/s: 5.5.4
   6.0.1
   6.2.1
   6.3
   6.4.2

> On 'downnode', lots of wasteful mutations are done to ZK
> 
>
> Key: SOLR-10277
> URL: https://issues.apache.org/jira/browse/SOLR-10277
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 5.5.3, 5.5.4, 6.0.1, 6.2.1, 6.3, 6.4.2
>Reporter: Joshua Humphries
>  Labels: leader, zookeeper
>
> When a node restarts, it submits a single 'downnode' message to the 
> overseer's state update queue.
> When the overseer processes the message, it does way more writes to ZK than 
> necessary. In our cluster of 48 hosts, the majority of collections have only 
> 1 shard and 1 replica. So a single node restarting should only result in 
> ~1/40th of the collections being updated with new replica states (to indicate 
> the node that is no longer active).
> However, the current logic in NodeMutator#downNode always updates *every* 
> collection. So we end up having to do rolling restarts very slowly to avoid 
> having a severe outage due to the overseer having to do way too much work for 
> each host that is restarted. And subsequent shards becoming leader can't get 
> processed until the `downnode` message is fully processed. So a fast rolling 
> restart can result in the overseer queue growing incredibly large and nearly 
> all shards winding up in a leader-less state until that backlog is processed.
> The fix is a trivial logic change to only add a ZkWriteCommand for 
> collections that actually have an impacted replica.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9811) Make it easier to manually execute overseer commands

2016-12-02 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15716043#comment-15716043
 ] 

Scott Blum commented on SOLR-9811:
--

I'm not sure, but it might have something to with race conditions when "moving" 
a replica.

An operation we do a lot of is create a new replica on a new machine, wait for 
it to become active, then delete the old replica.  It's possible that this 
process is what sometimes leaves us with a single replica marked both "DOWN" 
and "LEADER".

> Make it easier to manually execute overseer commands
> 
>
> Key: SOLR-9811
> URL: https://issues.apache.org/jira/browse/SOLR-9811
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Mike Drob
>
> Sometimes solrcloud will get into a bad state w.r.t. election or recovery and 
> it would be useful to have the ability to manually publish a node as active 
> or leader. This would be an alternative to some current ops practices of 
> restarting services, which may take a while to complete given many cores 
> hosted on a single server.
> This is an expert operator technique and readers should be made aware of 
> this, a.k.a. the "I don't care, just get it running" approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9811) Make it easier to manually execute overseer commands

2016-12-02 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15715840#comment-15715840
 ] 

Scott Blum commented on SOLR-9811:
--

The replica is marked by LEADER and DOWN.

Basically, I can't FORCELEADER because the replica isn't active, and I can't 
force recovery because the replica is already leader.

> Make it easier to manually execute overseer commands
> 
>
> Key: SOLR-9811
> URL: https://issues.apache.org/jira/browse/SOLR-9811
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Mike Drob
>
> Sometimes solrcloud will get into a bad state w.r.t. election or recovery and 
> it would be useful to have the ability to manually publish a node as active 
> or leader. This would be an alternative to some current ops practices of 
> restarting services, which may take a while to complete given many cores 
> hosted on a single server.
> This is an expert operator technique and readers should be made aware of 
> this, a.k.a. the "I don't care, just get it running" approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9811) Make it easier to manually execute overseer commands

2016-12-02 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15715835#comment-15715835
 ] 

Scott Blum commented on SOLR-9811:
--

REQUESTRECOVERY did not work:

{code}
2016-12-02 18:03:02.611 ERROR 
(recoveryExecutor-3-thread-10-processing-n:10.240.0.69:8983_solr 
x:24VFQ_shard1_replica0 s:shard1 c:24VFQ r:core_node1) [c:24VFQ s:shard1 
r:core_node1 x:24VFQ_shard1_replica0] o.a.s.c.RecoveryStrategy Error while 
trying to recover. 
core=24VFQ_shard1_replica0:org.apache.solr.common.SolrException: Cloud state 
still says we are leader.
at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:320)
{code}

> Make it easier to manually execute overseer commands
> 
>
> Key: SOLR-9811
> URL: https://issues.apache.org/jira/browse/SOLR-9811
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Mike Drob
>
> Sometimes solrcloud will get into a bad state w.r.t. election or recovery and 
> it would be useful to have the ability to manually publish a node as active 
> or leader. This would be an alternative to some current ops practices of 
> restarting services, which may take a while to complete given many cores 
> hosted on a single server.
> This is an expert operator technique and readers should be made aware of 
> this, a.k.a. the "I don't care, just get it running" approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7282) Cache config or index schema objects by configset and share them across cores

2016-11-30 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709534#comment-15709534
 ] 

Scott Blum commented on SOLR-7282:
--

Thanks Kevin, we'll use that.

> Cache config or index schema objects by configset and share them across cores
> -
>
> Key: SOLR-7282
> URL: https://issues.apache.org/jira/browse/SOLR-7282
> Project: Solr
>  Issue Type: Sub-task
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Noble Paul
> Fix For: 5.2, 6.0
>
> Attachments: SOLR-7282.patch
>
>
> Sharing schema and config objects has been known to improve startup 
> performance when a large number of cores are on the same box (See 
> http://wiki.apache.org/solr/LotsOfCores).Damien also saw improvements to 
> cluster startup speed upon caching the index schema in SOLR-7191.
> Now that SolrCloud configuration is based on config sets in ZK, we should 
> explore how we can minimize config/schema parsing for each core in a way that 
> is compatible with the recent/planned changes in the config and schema APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7282) Cache config or index schema objects by configset and share them across cores

2016-11-30 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709533#comment-15709533
 ] 

Scott Blum commented on SOLR-7282:
--

OH, it was just added in 6.2

https://issues.apache.org/jira/browse/SOLR-9216

> Cache config or index schema objects by configset and share them across cores
> -
>
> Key: SOLR-7282
> URL: https://issues.apache.org/jira/browse/SOLR-7282
> Project: Solr
>  Issue Type: Sub-task
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Noble Paul
> Fix For: 5.2, 6.0
>
> Attachments: SOLR-7282.patch
>
>
> Sharing schema and config objects has been known to improve startup 
> performance when a large number of cores are on the same box (See 
> http://wiki.apache.org/solr/LotsOfCores).Damien also saw improvements to 
> cluster startup speed upon caching the index schema in SOLR-7191.
> Now that SolrCloud configuration is based on config sets in ZK, we should 
> explore how we can minimize config/schema parsing for each core in a way that 
> is compatible with the recent/planned changes in the config and schema APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9811) Make it easier to manually execute overseer commands

2016-11-30 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709464#comment-15709464
 ] 

Scott Blum commented on SOLR-9811:
--

Yeah I'll give that a shot next time, didn't know it existed before today.

> Make it easier to manually execute overseer commands
> 
>
> Key: SOLR-9811
> URL: https://issues.apache.org/jira/browse/SOLR-9811
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Mike Drob
>
> Sometimes solrcloud will get into a bad state w.r.t. election or recovery and 
> it would be useful to have the ability to manually publish a node as active 
> or leader. This would be an alternative to some current ops practices of 
> restarting services, which may take a while to complete given many cores 
> hosted on a single server.
> This is an expert operator technique and readers should be made aware of 
> this, a.k.a. the "I don't care, just get it running" approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9811) Make it easier to manually execute overseer commands

2016-11-30 Thread Scott Blum (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-9811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709449#comment-15709449
 ] 

Scott Blum commented on SOLR-9811:
--

Seems fine to me.  I was mostly posting what I'd done for [~mdrob] who needs to 
do something similiar.  I've tried FORCELEADER a few times but for me it never 
puts a replica erroneously marked as DOWN into an ACTIVE state.

> Make it easier to manually execute overseer commands
> 
>
> Key: SOLR-9811
> URL: https://issues.apache.org/jira/browse/SOLR-9811
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Reporter: Mike Drob
>
> Sometimes solrcloud will get into a bad state w.r.t. election or recovery and 
> it would be useful to have the ability to manually publish a node as active 
> or leader. This would be an alternative to some current ops practices of 
> restarting services, which may take a while to complete given many cores 
> hosted on a single server.
> This is an expert operator technique and readers should be made aware of 
> this, a.k.a. the "I don't care, just get it running" approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 3 4 5 6 >

1 - 100 of 500 matches

Mail list logo