Elasticsearch-HQ screenshot of node analysis
https://github.com/putztzu/Misc_images/blob/master/elasticsearch-hq_Why_ES4.png

Elasticsearch 1.0 RC1

5-Node Cluster information
ES-Marvel-openSUSE
(Runs web, logstash, redis and other apps so given more RAM)
4GB RAM / 20GB ES Storage
Elasticsearch-1
1GB RAM / 20GB ES Storage
Elasticsearch-2
1GB RAM / 20GB ES Storage
Elasticsearch-3
1GB RAM / 20GB ES Storage
 
Elasticsearch-4

1GB RAM / 20GB ES Storage

Data Description
Apache data, indexed by date

Data Content
each index should more or less be the same amount of data.
The expectation should be that on average shards should therefor be more or 
less the same size.

"Normal" behavior observed
The same data has been inserted into this cluster 3 times (of course purge 
between each reload)
The first two times data was distributed across all the nodes more or less 
evenly.

Anomaly observed
The current third time the cluster was setup, from the outset an anomaly 
was observed, data usage grew unusually fast on the node the data was being 
inserted into (ES-Marvel-openSUSE) and one of the other nodes 
(ELASTICSEARCH-4).

Moreover, after 2 cluster shutdowns and recoveries, the problem seems to be 
exacerbated. the unequal data distribution not only persisted, but it looks 
like with each recovery permanent additional data was created across all 
nodes.

1. When a cluster is recovered and no additional raw data is inserted, the 
increase in data storage suggests that additional ES data is created which 
may make sense since it looks like shard re-allocation takes place 
regardless whether it should have been disabled. This can make sense to 
some degree since it has been posted that it's cheaper to simply copy 
shards than to do integrity checks and re-integrate. Am running Marvel but 
according to es-head and es-hq the Marvel data is very little compared to 
the major increases I'm seeing and those shards aren't being allocated to 
ES-4. Does the increase in used disk storage suggest that obsolete data is 
not being purged?

2. What determines "balance" regarding shard allocation? Using es-head, I 
can see that <sometimes> fewer shards might be allocated to the node with 
fast shrinking disk space (ELASTICSEARCH-4), but after awhile it looks like 
allocation goes back to normal. Note that RAM and CPU capacity for all 
nodes is equal.

3. In this kind of situation, is there a recommended remedy? Since this 
appears to be a "runaway" scenario that appears to keep feeding a node that 
shortly won't have any capacity, I've been considering simply shutting down 
the problem node, purging its data, re-joining and then hoping the ES 
Cluster will then re-balance itself. Would this be a recommended procedure 
after verifying all shards on the problem node have replicas on other 
nodes? If the situation is "runaway" I don't consider simply adding storage 
to be a viable solution.
4. The Host machine these virtual machines is running on indicates massive 
disk activity, but am uncertain what to attribute it to. According to 
es-hq, two indices seem to be in the process of being initialized but 
according to es-head all shards have been allocated and "green." Since no 
new data is being inserted and and all existing shards should be healthy, I 
don't know why there should be any index initialization activity. Update- 
After sitting on es-hq awhile, I'm noticing that after shard 
initialization, there is a re-allocation which might be related. But no 
easy visibility on what this shard is on which node and if it really is 
being re-allocated.
 
5. Is there a ready tool to display (or return) specifically the ES 
overhead data I suspect is being stored on nodes? So far I've only found 
overall data usage or free space. If not available, I suspect a workaround 
could be to query for the shard data size(?) and then subtract from overall 
storage data used. If such a tool exists and perhaps even breaking down how 
it's being used then maybe I can start to understand exactly what may be 
running differently in this cluster.
 

Am speculating that something may not have been setup properly in this 
cluster from the beginning, but am uncertain how to analyze exactly what 
the problem is. Have posted the elasticsearch-hq screenshot at the top of 
this post for reference, but if someone can suggest a command to further 
extract possibly useful information, I'm open.

Thankfully this cluster is a lab, so I'm treating this as a learning 
experience but if this occurred in a larger Production cluster I imagine 
this would be setting off alarm bells.

Thx
Tony 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/255454b9-e885-4ada-8b0c-4c28018ebc4c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to