On Fri, Jun 30, 2017 at 7:13 PM, Guillaume Lederrey <[email protected]> wrote: > Hello! > > We've had a significant slowdown of elasticsearch today (see Grafana > for exact timing [1]). The impact was low enough that it probably does > not require a full incident report (the number of errors did not raise > significantly [2]), but understanding what happened and sharing that > understanding is important. This is going to be a long and technical > email, you might get bored, feel free to close it and delete it right > now. > > TL;DR: elastic1019 was overloaded, having too many heavy shards, > banning all shards from elastic1019 to reshuffle allowed it to > recover. > > In more details: > > elastic1019 was hosting shards for commonswiki, enwiki and frwiki, > which are all high load shards. elastic1019 is one of our older > server, which are less powerfull, and might also suffer from CPU > overheating [3]. > > The obvious question: "why do we even allow multiple heavy shards to > be allocated on the same node?". The answer is obvious as well: "it's > complicated...". > > One of the very interesting feature of elasticsearch is its ability to > automatically balance shards. This allows the cluster to automatically > rebalance in case nodes are lost, and to automatically balance shards > to spread resource usage across all nodes in the cluster [4]. > Constraints can be added to account for available disk space [5], rack > awareness [6], or even have specific filtering for specific indices > [7]. It does not directly allow to constraint allocation based on the > load of a specific shard. > > We do have a few mechanism to ensure that load is as uniform as > possible on the cluster: > > An index is split in multiple shards, a shard is replicated multiple > times to provide redundancy and to spread load. Those are configured > by index. > > We know which are the heavy indices (commons, enwiki, frwiki, ...), > both in term of size and in term of traffic. Those indices are split > in a number of shards+replicas close to the number of nodes in the > cluster, to ensure that those shards are spread evenly on the cluster, > with only a few shards of the same index on the same node, but still > allow to loose a few nodes and keep all shards allocated. For example, > enwiki_content has 8 shards, with 2 replicas each, so a total number > of 24 shards, with a maximum of 2 shards on the same node. This > approach works well most of the time.
Did you wonder how 8 shards * 2 replicas makes 24 total shards? Well, it is actually 8 shards * (1 primary + 2 replicas). And now it make sense! > The limitation is that a shard is a "scalability unit", you can't move > around something smaller than a shard. In the case of enwiki, a single > shard is ~40Go and a fairly large number of requests per second. If > you have a node that has just one more of those shards, that's already > a significant amount of additional load. > > The solution could be to split large indices in a lot more shards, the > scalability unit would be much smaller, and it would be much easier to > have a uniform load. Of course, there are also limitations. The total > number of shards in the cluster has a significant cost. Increasing it > will add load to cluster operations (which are already quite expensive > in with the total number of shards we have at this point). There are > also functional issues: ranking (BM25) uses statistics calculated per > shard, with smaller shards at some point the stats might not be > relevant of the whole corpus. > > There are probably a lot more detail we could get into, feel free to > ask more questions and we can continue the conversation. And I'm sure > David and Erik have a lot to add! > > Thanks for reading to the end! > > Guillaume > > > > > > [1] > https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&panelId=20&fullscreen&from=1498827816290&to=1498835616852 > [2] > https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&panelId=9&fullscreen&from=1498827816290&to=1498835616852 > [3] https://phabricator.wikimedia.org/T168816 > [4] > https://www.elastic.co/guide/en/elasticsearch/reference/current/shards-allocation.html > [5] > https://www.elastic.co/guide/en/elasticsearch/reference/current/disk-allocator.html > [6] > https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-awareness.html > [7] > https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-allocation-filtering.html > > > -- > Guillaume Lederrey > Operations Engineer, Discovery > Wikimedia Foundation > UTC+2 / CEST -- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST _______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
