Remedying improper data allocation across some nodes?

Tony Su Thu, 13 Feb 2014 08:34:14 -0800


Elasticsearch-HQ screenshot of node analysis
https://github.com/putztzu/Misc_images/blob/master/elasticsearch-hq_Why_ES4.png

Elasticsearch 1.0 RC1

5-Node Cluster information
ES-Marvel-openSUSE
(Runs web, logstash, redis and other apps so given more RAM)
4GB RAM / 20GB ES Storage
Elasticsearch-1
1GB RAM / 20GB ES Storage
Elasticsearch-2
1GB RAM / 20GB ES Storage
Elasticsearch-3
1GB RAM / 20GB ES Storage

Elasticsearch-4

1GB RAM / 20GB ES Storage

Data Description
Apache data, indexed by date

Data Content
each index should more or less be the same amount of data.
The expectation should be that on average shards should therefor be more or
less the same size.

"Normal" behavior observed
The same data has been inserted into this cluster 3 times (of course purge
between each reload)
The first two times data was distributed across all the nodes more or less
evenly.

Anomaly observed
The current third time the cluster was setup, from the outset an anomaly
was observed, data usage grew unusually fast on the node the data was being
inserted into (ES-Marvel-openSUSE) and one of the other nodes
(ELASTICSEARCH-4).

Moreover, after 2 cluster shutdowns and recoveries, the problem seems to be
exacerbated. the unequal data distribution not only persisted, but it looks
like with each recovery permanent additional data was created across all
nodes.

1. When a cluster is recovered and no additional raw data is inserted, the
increase in data storage suggests that additional ES data is created which
may make sense since it looks like shard re-allocation takes place
regardless whether it should have been disabled. This can make sense to
some degree since it has been posted that it's cheaper to simply copy
shards than to do integrity checks and re-integrate. Am running Marvel but
according to es-head and es-hq the Marvel data is very little compared to
the major increases I'm seeing and those shards aren't being allocated to
ES-4. Does the increase in used disk storage suggest that obsolete data is
not being purged?

2. What determines "balance" regarding shard allocation? Using es-head, I
can see that <sometimes> fewer shards might be allocated to the node with
fast shrinking disk space (ELASTICSEARCH-4), but after awhile it looks like
allocation goes back to normal. Note that RAM and CPU capacity for all
nodes is equal.

3. In this kind of situation, is there a recommended remedy? Since this
appears to be a "runaway" scenario that appears to keep feeding a node that
shortly won't have any capacity, I've been considering simply shutting down
the problem node, purging its data, re-joining and then hoping the ES
Cluster will then re-balance itself. Would this be a recommended procedure
after verifying all shards on the problem node have replicas on other
nodes? If the situation is "runaway" I don't consider simply adding storage
to be a viable solution.
4. The Host machine these virtual machines is running on indicates massive
disk activity, but am uncertain what to attribute it to. According to
es-hq, two indices seem to be in the process of being initialized but
according to es-head all shards have been allocated and "green." Since no
new data is being inserted and and all existing shards should be healthy, I
don't know why there should be any index initialization activity. Update-
After sitting on es-hq awhile, I'm noticing that after shard
initialization, there is a re-allocation which might be related. But no
easy visibility on what this shard is on which node and if it really is
being re-allocated.

5. Is there a ready tool to display (or return) specifically the ES
overhead data I suspect is being stored on nodes? So far I've only found
overall data usage or free space. If not available, I suspect a workaround
could be to query for the shard data size(?) and then subtract from overall
storage data used. If such a tool exists and perhaps even breaking down how
it's being used then maybe I can start to understand exactly what may be
running differently in this cluster.

Am speculating that something may not have been setup properly in this
cluster from the beginning, but am uncertain how to analyze exactly what
the problem is. Have posted the elasticsearch-hq screenshot at the top of
this post for reference, but if someone can suggest a command to further
extract possibly useful information, I'm open.

Thankfully this cluster is a lab, so I'm treating this as a learning
experience but if this occurred in a larger Production cluster I imagine
this would be setting off alarm bells.

Thx
Tony

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/255454b9-e885-4ada-8b0c-4c28018ebc4c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Remedying improper data allocation across some nodes?

Reply via email to