Hi Mailing List!  I'm a first-time poster, and a long time reader.

We recently had a crash in our ES (1.3.1 on Ubuntu) cluster which caused us 
to loose a significant volume of data.  I have a "theory" on what happened 
to cause this, and I would love to hear your opinions on this, and if you 
have any suggestions to mitigate it.

Here is a simplified play-by-play:


   1. Cluster has 3 data nodes, A, B, and C.  The index has 10 shards.  The 
   index has a replica count of 1, so A is the master and B is a replica.  C 
   is doing nothing.  Re-allocation of indexes/shards is enabled.  
   2. A crashes.  B takes over as master, and then starts transferring data 
   to C as a new replica. 
   3. B crashes.  C is now master with an impartial dataset. 
   4. There is a write to the index.
   5. A and B finally reboot, and they are told that they are now stale (as 
   C had a write while they were away).  Both A and B delete their local data. 
    A is chosen to be the new replica and re-sync from C.  
   6. ... all the data A and B had which C never got is lost forever.
   

Is the above situation scenario possible?  If it is, it seems like the 
default behavior of ES might be better to not reallocate in this scenario? 
 This would have caused the write in step #4 to fail, but in our use case, 
that is preferable to data loss. 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/84ba332b-2e34-4ce4-aaa2-acfa616f3230%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to