[ 
https://issues.apache.org/jira/browse/SOLR-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123800#comment-13123800
 ] 

Mark Miller commented on SOLR-2765:
-----------------------------------

The current method of dealing with downed nodes is not so bad - the cluster 
layout is compared with the live_nodes - this gives searchers the ability to 
know a node is down within the ephemeral timeout. Before that happens (a brief 
window), failed requests are simply retried on another replica. The searcher 
locally marks that the server is bad, and then periodically tries it again - 
unless the ephemeral goes down and it is no longer consulted.

bq. The client cannot derive this information accurately from simple liveness 
information.

It's simply not supported that way currently - this is intentional though. If 
you want to change which shards a node is responsible for serving, you don't 
just bring it back up with fewer or different shards - you first delete the 
node info from the cluster layout, then you bring it up. We didn't mind that a 
variety of advanced scenarios require manual editing of the zk layout at the 
time. We have intended to move towards a separate model and state layout 
eventually though (see the solrcloud wiki page). That is essentially in the 
proposed path I think.

I bias-ly lean against an overseer almost more than optimistic collection 
locks, but I have not had time to fully digest the latest proposed changes. I 
suppose that when you have a solid leader election process available, an 
overseer is fairly cheap, and if used for the right things, fairly simple. When 
we get into rebalancing (we don't plan to right away), I suppose we come back 
to it anyhow.

bq. marking replicas as defunct might do, 

Yeah, I think this gets complicated to do well in general. I like simple 
solutions like the one above. And I think good monitoring is a perfectly 
acceptable requirement for a very large cluster.


It's good stuff to consider. Exploring all of these changes should likely be 
spun off into anther issue though. Advancements in how we handle all of this 
are a much larger issue than Shard/Node states.
                
> Shard/Node states
> -----------------
>
>                 Key: SOLR-2765
>                 URL: https://issues.apache.org/jira/browse/SOLR-2765
>             Project: Solr
>          Issue Type: Sub-task
>          Components: SolrCloud, update
>            Reporter: Yonik Seeley
>             Fix For: 4.0
>
>         Attachments: combined.patch, incremental_update.patch, 
> scheduled_executors.patch, shard-roles.patch
>
>
> Need state for shards that indicate they are recovering, active/enabled, or 
> disabled.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to