On Wed, May 21, 2014 at 9:33 PM, Dustin Boswell <[email protected]> wrote:
> Hi everyone,
> I have a 10 million document index, and multiple high-memory (e.g. 250GB
> ram, 32 cores) machines available. I'd like to do everything possible to
> keep search latency as low as possible (< 50ms ideally), especially during
> a high-throughput environment. I know it depends a lot on the query, but
> to start with I'm asking about general index/cluster settings.
>
> Here's a list of things I'm doing so far:
> - ES_HEAP_SIZE=100g
> - machine has no swap
> - 20 shards
>
> Are there any other settings or search parameters I should be aware of?
>
>
If you care about latency, it might make sense to configure a smaller heap
size. The issue with large heaps is that they take longer to collect,
especially for collections of the old generation (I'm talking about minutes
here). I would recommend setting the heap size at at most 30GB (which will
also allow you to benefit from compressed pointers), one way to go this
route could be to start several nodes per physical machine.
> Also, I'm wondering how many shards is recommended in my case. Having more
> shards helps reduce latency by parallelizing the work, but at some point
> the overhead of fanning out the requests and collecting the partial results
> will take over and latency would get worse. Is there a rule of thumb for a
> sweet spot that others have found?
>
If you don't have a lot of traffic, you could think about configuring
num_shards = total_num_cpus / num_concurrent_queries. But as you said,
there is also some overhead to large numbers of shards so this deserves
testing.
> The volume of updates to the index is relatively small (500K/day), but
> bursty. From initial testing, it seems like updates being issued can
> increase the search latency happening on the same machine. Is there a good
> way to "isolate" search and updates, either by some setting, or splitting
> up the cluster somehow to have dedicated update nodes and dedicated search
> nodes? (Not sure how you'd deploy a setup like this, or control where the
> search/update calls went.)
>
> The query I'm optimizing for will have a text search component and a
> geo-restrict component, maybe something like this:
> {
> "query": {
> // query may get more complex in the future
> "match": { "_all": "my search terms" }
> },
> "filter": {
> "geo_distance": {
> "distance": "100km",
> "location": {
> "lat": 34.04,
> "lon": -118.49
> }
> }
> }
> }
>
> For the geo filter, I've tried the optimize_bbox option, and the default
> of "memory" seemed to work the best, surprisingly. I haven't tried using
> geohash yet, and I can't tell from the docs how one might use it, but maybe
> that is inherently faster since it uses indexes?
>
The bbox optimization is useful if your geo query matches a small portion
of your index. Maybe the issue here is that with such a large radius, you
match most of your documents?
> Unfortunately, there are a lot of unique locations in my query stream, so
> I don't know if caching this filter will work. (Each filter cache consumes
> about 1 bit in memory per document, is that right? So about 1.25MB in my
> case. Storing the most frequent 10,000 of these would take up about 12.5GB
> of ram. So maybe that's doable...)
>
One thing to beware of is that if you cache a geo filter, Elasticsearch
will need to evaluate it against all documents from your index before
caching it. On the other hand by default (if the filter is not cached), the
filter will be only evaluated on documents that match the query (but for
all queries not the first one). So unless you have good reasons to think
that your geo filter will be reused, I would recommend against caching it.
--
Adrien Grand
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6Y%2BdgRns8JwcuigDprR0_Yr8PN58cyfKhYVcFYnE4m7g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.