Elasticsearch-HQ screenshot of node analysis https://github.com/putztzu/Misc_images/blob/master/elasticsearch-hq_Why_ES4.png
Elasticsearch 1.0 RC1 5-Node Cluster information ES-Marvel-openSUSE (Runs web, logstash, redis and other apps so given more RAM) 4GB RAM / 20GB ES Storage Elasticsearch-1 1GB RAM / 20GB ES Storage Elasticsearch-2 1GB RAM / 20GB ES Storage Elasticsearch-3 1GB RAM / 20GB ES Storage Elasticsearch-4 1GB RAM / 20GB ES Storage Data Description Apache data, indexed by date Data Content each index should more or less be the same amount of data. The expectation should be that on average shards should therefor be more or less the same size. "Normal" behavior observed The same data has been inserted into this cluster 3 times (of course purge between each reload) The first two times data was distributed across all the nodes more or less evenly. Anomaly observed The current third time the cluster was setup, from the outset an anomaly was observed, data usage grew unusually fast on the node the data was being inserted into (ES-Marvel-openSUSE) and one of the other nodes (ELASTICSEARCH-4). Moreover, after 2 cluster shutdowns and recoveries, the problem seems to be exacerbated. the unequal data distribution not only persisted, but it looks like with each recovery permanent additional data was created across all nodes. 1. When a cluster is recovered and no additional raw data is inserted, the increase in data storage suggests that additional ES data is created which may make sense since it looks like shard re-allocation takes place regardless whether it should have been disabled. This can make sense to some degree since it has been posted that it's cheaper to simply copy shards than to do integrity checks and re-integrate. Am running Marvel but according to es-head and es-hq the Marvel data is very little compared to the major increases I'm seeing and those shards aren't being allocated to ES-4. Does the increase in used disk storage suggest that obsolete data is not being purged? 2. What determines "balance" regarding shard allocation? Using es-head, I can see that <sometimes> fewer shards might be allocated to the node with fast shrinking disk space (ELASTICSEARCH-4), but after awhile it looks like allocation goes back to normal. Note that RAM and CPU capacity for all nodes is equal. 3. In this kind of situation, is there a recommended remedy? Since this appears to be a "runaway" scenario that appears to keep feeding a node that shortly won't have any capacity, I've been considering simply shutting down the problem node, purging its data, re-joining and then hoping the ES Cluster will then re-balance itself. Would this be a recommended procedure after verifying all shards on the problem node have replicas on other nodes? If the situation is "runaway" I don't consider simply adding storage to be a viable solution. 4. The Host machine these virtual machines is running on indicates massive disk activity, but am uncertain what to attribute it to. According to es-hq, two indices seem to be in the process of being initialized but according to es-head all shards have been allocated and "green." Since no new data is being inserted and and all existing shards should be healthy, I don't know why there should be any index initialization activity. Update- After sitting on es-hq awhile, I'm noticing that after shard initialization, there is a re-allocation which might be related. But no easy visibility on what this shard is on which node and if it really is being re-allocated. 5. Is there a ready tool to display (or return) specifically the ES overhead data I suspect is being stored on nodes? So far I've only found overall data usage or free space. If not available, I suspect a workaround could be to query for the shard data size(?) and then subtract from overall storage data used. If such a tool exists and perhaps even breaking down how it's being used then maybe I can start to understand exactly what may be running differently in this cluster. Am speculating that something may not have been setup properly in this cluster from the beginning, but am uncertain how to analyze exactly what the problem is. Have posted the elasticsearch-hq screenshot at the top of this post for reference, but if someone can suggest a command to further extract possibly useful information, I'm open. Thankfully this cluster is a lab, so I'm treating this as a learning experience but if this occurred in a larger Production cluster I imagine this would be setting off alarm bells. Thx Tony -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/255454b9-e885-4ada-8b0c-4c28018ebc4c%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
