[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931532#comment-15931532
 ] 

Shawn Heisey commented on SOLR-10265:
-------------------------------------

bq. We should seriously reconsider the idea of using ZK queue for the overseer 
messages. 

Serious question, not being confrontational:  What are the characteristics of 
zookeeper which make it bad choice for the overseer queue?  My suspicion is 
that it's problematic because of the number of messages in a typical queue, not 
because of technology issues.  If a way can be found to accomplish the overseer 
job with a very small number of messages instead of a very large number, would 
ZK be suitable?

Can't we create more queue entry types that can accomplish with a single entry 
what currently can take hundreds or thousands of entries (depending on the 
number of collections)?  I can envision types that indicate "node X just came 
up" and "node X just went down" ... the overseer should be able to use the 
cluster information in ZK to decide what state changes are required, rather 
than processing an individual message for each state change.  Other types that 
I haven't thought of would probably be needed.  Another type might specify 
state changes that affect multiple collections or shard replicas with a single 
entry.

When a collection is created, is there REALLY a need to send state changes for 
EVERY collection in the cloud?  That seems completely unnecessary to me.  I 
have not attempted other actions like creating a new replica, but I would not 
be surprised to learn that they also send many unnecessary state changes.

If ZK isn't used for the overseer queue, then some other method must be found 
to make sure the queue is not lost when nodes go down.  An in-memory queue 
would need to be somehow replicated to at least one other node in the cluster.  
Doing that well probably means finding another dependency where somebody else 
has worked out the bugs.


> Overseer can become the bottleneck in very large clusters
> ---------------------------------------------------------
>
>                 Key: SOLR-10265
>                 URL: https://issues.apache.org/jira/browse/SOLR-10265
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to