[ 
https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931544#comment-15931544
 ] 

Erick Erickson commented on SOLR-10265:
---------------------------------------

bq: My suspicion is that it's problematic because of the number of messages in 
a typical queue

Well, that's certainly true, in this case the overseer work queue for at least 
two reasons: 1> the current method of dealing with the queue doesn't batch and 
2> the state changes get written to the logs which get _huge_. Part of the bits 
we're exploring are being smarter about that too.

bq: I can envision types that indicate "node X just came up" and "node X just 
went down"

"Just went down" is being considered, but "just came up" is a different 
problem. Consider a single JVM hosting 400 replicas from 100 different 
collections each of which has 4 shards. "Node just came up" could (I think) 
move them all from "down" to "recovering" in one go, but all the leader 
election stuff still needs to be hashed out. Also consider if a leader and 
follower are on the same node. How do they coordinate the leader coming up and 
informing the follower of that fact without sending individual messages?

bq: When a collection is created, is there REALLY a need to send state changes 
for EVERY collection in the cloud? 

IIUC this just shouldn't be happening. If it is we need to fix it. I've 
certainly seen some allegations that this is so.

bq: If ZK isn't used for the overseer queue, ....

Well, that's part of the proposal. The in-memory Overseer queue is OK 
(according to Noble's latest comment) if each client retries submitting the 
pending operations that aren't ack'd. So no alternate queueing mechanism is 
necessary.

What's really happening here IMO is the evolution of SorlCloud. So far we've 
gotten away with inefficiencies because we haven't really had huge clusters to 
deal with, now we're getting them. So we need to squeeze what efficiencies we 
can out of the current architecture while considering if we really need ZK as 
the intermediary. Of course cutting down the number of messages would help, the 
fear is that there's been a great deal of work put into hardening all the wonky 
conditions so I do worry a bit about wholesale restructuring.... Ya gotta bite 
the bullet sometime though.


> Overseer can become the bottleneck in very large clusters
> ---------------------------------------------------------
>
>                 Key: SOLR-10265
>                 URL: https://issues.apache.org/jira/browse/SOLR-10265
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Varun Thacker
>
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard 
> collection.
> - Index into the collection for 1 hour and then create a new collection 
> - For a 30 days retention window with these numbers we would end up wth  
> ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer 
> queue gets backed up. If a few nodes looses connectivity to ZooKeeper then 
> also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale. 
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 
> 1 replica ) and generates 20k state change events. The test took 119 seconds 
> to run on my machine which means ~170 events a second. Let's say a server can 
> process 5x of my machine so 1k events a second. 
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes 
> till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events 
> slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of 
> events generated to 2000 instead. The test took only 5 seconds to run. So it 
> was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. 
> This can lead to the zookeeper transaction logs going out of control even 
> with auto purging etc set . 
> I haven't debugged why the transaction logs ran into terabytes without taking 
> into snapshots but this was my assumption based on the other problems we 
> observed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to