[
https://issues.apache.org/jira/browse/SOLR-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123702#comment-13123702
]
Mark Miller commented on SOLR-2358:
-----------------------------------
Initially, a request will be fully synchronous and will not return success to
the client until the request is sent to each replica. So if a leader goes down
before all replicas receive and ACK the request, the client will not get an
ACK. A new leader will be elected. When the downed, previous leader comes back,
he will come up in recovery mode. I expect recovery to be a difficult part and
we have not fully worked it out yet. To recover, the node will have to talk to
the leader and figure out what it has that it should not, what it doesn't have,
etc. Then the recovering node either receives replays, or replaces the entire
index. Lot's of details to work out here.
You have an interesting problem in that some replica leader candidates may have
an update while others don't, as the leader may have died in the middle of
relaying requests. We might prefer a new leader with the greatest versioned
doc? Most client retries in this case will be fine (global unique id's are
required, so no worry about dupes). Then replicas talk to the leader and sync
up. Or when a new leader is elected, replicas just talk amongst each other and
sync up, or…
If the leader fails right before sending an ACK, the client will likely repeat
the request. In the case of doc adds/updates and the same id it will just
replace the previous success or will be able to use optimistic locking to
figure out that either its update or someone else's actually went through
already. The client would already know that perhaps its update went through
because the connection would have timed out rather than receive a failure.
Eventually, we might consider a mode where the request is ACK'd before it's on
all replicas, in which case you might accept a higher risk of data loss.
bq. indexes diverge because some replicas commit a change while others do not
It's an area we have not fully worked out (though Yonik has likely thought
about a lot of this more than I have yet) - initially though, Yonik's point was
that you can usually expect success on all nodes unless the issue is something
that would require the node come down and then come back in recovery mode I
think. We certainly want to be resilient here eventually though. As we work
through recovery scenarios, I think this will become more clear.
Long, short, we have been discussing and thinking about these various
scenarios, but largely we are also taking things an issue at a time.
> Distributing Indexing
> ---------------------
>
> Key: SOLR-2358
> URL: https://issues.apache.org/jira/browse/SOLR-2358
> Project: Solr
> Issue Type: New Feature
> Components: SolrCloud, update
> Reporter: William Mayor
> Priority: Minor
> Fix For: 4.0
>
> Attachments: SOLR-2358.patch
>
>
> The first steps towards creating distributed indexing functionality in Solr
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]