Hey guys,

I'm not sure if you could do anything about the issue, but still, I wanted 
to make sure you are aware of it. So, here is the story:


I work for Swiftype.com. Due to the nature of our business and the way we 
implement user search engines as separate or aggregated indexes in ES and 
use per-customer aliases to route requests to an appropriate place. So, we 
end up with clusters that have ~1000 shards and could easily have 50-150k 
aliases. ES holds up really well under load and the only issue we have 
which has been the major pain point for us for the last 4 months is the way 
ES handles cluster state changes.

Every time we add/remove an alias or an index, ES generates gigabytes of 
garbage in java memory. Adding more java heap does not really help, because 
then those piles of garbage start causing very long old gen pauses and that 
is really painful for us since our clusters are constantly under load and 
having them stop for seconds to collect garbage is unacceptable.

For months we've been building all kinds of bandages to mitigate the 
effects of this issue (like creating an external per-cluster locking system 
to avoid consurrent cluster state changes and other crazy stuff like that) 
and trying to figure out why was it was happening without much luck. 

Yesterday we've added a master-only (no data) node to one of our small 
clusters and gave it 4G of heap (with new ratio = 3). Even during the 
process of joining the cluster it went to GC for 5-10 sec multiple times, 
every time failing to join becuase of timeouts. After many retries it'd 
finally join and old gen would keep steady at ~1.2Gb (out of 3Gb we gave 
it) for hours and hours. But then, when some users would start 
creating/changing their indexes too frequently (once every few seconds), 
the server would just go crazy and end up in a Full GC trashing mode.

I've made a java heap dump from this server today and since there is no 
Lucene crap in it (as it usually is on larger instances with data), it was 
REALLY obvious now where the problem is: in my dump I see our cluster 
MetaData object size is ~681Mb and we have 6 (!) copies of it live in 
memory: 
  * one copy in current cluster metadata
  * one copy in a running InternalClusterService$UpdateTask
  * and 4 copies in InternalClusterService$UpdateTask instances in the 
blocking queue for update service.

I understand that we could add many more gigabytes of RAM and that would 
help us to handle such "surges" of cluster state updates, but I'm afraid 
that would just cost us a lot in GC pauses and would not really solve the 
issues we're having on data nodes anyways.


So, I was wondering if you have any suggestions on how can we mitigate this 
issue in the short term (aside from not creating so many aliases, but that 
does it not really possible for us at the moment) and if you have any ideas 
of how it could be solved in the server code in the long run.

I'd really appreciate any help!


P.S. We're on 0.90.5. If there are any fixes around cluster state handling 
in the newer versions, please let me know. We're going to be upgrading our 
clusters very soon and if there is a chance new release would help with GC 
issues, I'd bump up the priority of the upgrade task in our task list.

--
Oleksiy Kovyrin
Head of Technical Operations
Swiftype.com

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8972fd4b-44eb-4188-9831-6469bc8e216d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to