Hi Mailing List! I'm a first-time poster, and a long time reader.
We recently had a crash in our ES (1.3.1 on Ubuntu) cluster which caused us
to loose a significant volume of data. I have a "theory" on what happened
to cause this, and I would love to hear your opinions on this, and if you
have any suggestions to mitigate it.
Here is a simplified play-by-play:
1. Cluster has 3 data nodes, A, B, and C. The index has 10 shards. The
index has a replica count of 1, so A is the master and B is a replica. C
is doing nothing. Re-allocation of indexes/shards is enabled.
2. A crashes. B takes over as master, and then starts transferring data
to C as a new replica.
3. B crashes. C is now master with an impartial dataset.
4. There is a write to the index.
5. A and B finally reboot, and they are told that they are now stale (as
C had a write while they were away). Both A and B delete their local data.
A is chosen to be the new replica and re-sync from C.
6. ... all the data A and B had which C never got is lost forever.
Is the above situation scenario possible? If it is, it seems like the
default behavior of ES might be better to not reallocate in this scenario?
This would have caused the write in step #4 to fail, but in our use case,
that is preferable to data loss.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/84ba332b-2e34-4ce4-aaa2-acfa616f3230%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.