Micheal: I'd would expect that setting the `size` option on the terms agg to a smaller value would have a positive impact on the total query time. Feels like I'm missing something, can you run hot threads api ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html#cluster-nodes-hot-threads) while your run the search request that you've shared before? This basically gives a cluster wide stack dump and can perhaps give me an insight why your search request is slow.
Setting the `size` option of terms agg to 0 will return all buckets of that can be found on the fingerprint field (which can be millions of buckets), so I can see how this can bring down your cluster, because that simply doesn't fit in the Java heap space. Dustin: The `top_hits` aggregation is always nested under a bucket aggregator (for example the `terms` bucket aggregator). For each bucket the terms aggregator create the top_hits aggregator will create a priority queue, where this top_hits aggregator is going to maintain the top N docs that fall under the bucket it is in. So the time spent by the top_hits aggregator, like any other nested aggregator depends on the number of buckets being maintained during the execution of the search request. With the top_hits this is more noticeable compared to for example a metric agg (min, max, avg etc.), because of what the top_hits aggregator does. On 7 January 2015 at 20:29, Dustin Boswell <[email protected]> wrote: > I'm curious what the underlying algorithm is for TopHits. > > My mental model for ordinary aggregations is that there's basically a hash > table of (field_value -> count) maintained (for each field being > aggregated), and that hash table count is incremented once per document, > and then the top K elements of that hash table are returned to the user. > So there's O(1) work for each document scored, and then a final O(N*logN) > sort on that hash table to get the top K, where N is the number of unique > field_values. It makes sense to me why this implementation would be very > fast. > > My mental model for a top_hits aggregation is that there's a hash table of > (field_value -> array(pair(doc_id, score))). And for each document being > scored, that (doc_id, score) is appended to the corresponding array. Again, > there's only O(1) work for each document. At the end, you have to sort > each array, and then sort the hash table, and take the top K1 arrays, and > the top K2 elements of each array, and then for each doc_id, pull out the > relevant fields to return to the user. So definitely more work (and a lot > more memory), but I'm not sure if this would result in the 30x increase in > runtime we're seeing. (And actually, for the special case where > top_hits->size == 1, you only need the top (doc_id, score) seen, not a > whole array, so that would be a lot faster and less memory. But I > understand it needs to be able to handle more general cases.) > > Is this at all close to how it works? > > On Tuesday, January 6, 2015 11:20:08 PM UTC-8, Martijn v Groningen wrote: >> >> Hi Michael, >> >> In general the more buckets being returned by the parent aggregator the >> top_hits is nested in, the more work the top_hits agg needs to do, but I >> didn't come across performance issues with `size` on terms agg being set to >> 50 and the time it takes to execute increasing 30 times when top_hits is >> used. To exclude this on your side, can you play around with the `size` >> option on terms agg? >> >> Also perhaps the _source of your documents are relatively large. How does >> the top_hits agg perform without the `_source` option on the top_hits agg? >> >> Martijn >> >> On 6 January 2015 at 22:29, Michael Irani <[email protected]> wrote: >> >>> Sure. I simplified the query to keep things focused. >>> >>> This query takes about 3 seconds to run: >>> >>> { >>> >>> "size": 0, >>> >>> "aggs": { >>> "top-fingerprints": { >>> "terms": { >>> "field": "fingerprint", >>> "size": 50 >>> }, >>> "aggs": { >>> "top_tag_hits": { >>> "top_hits": { >>> "size": 1, >>> "_source": { >>> "include": [ >>> "title" >>> ] >>> } >>> } >>> } >>> } >>> } >>> } >>> >>> } >>> >>> >>> This one takes about 80 milliseconds: >>> >>> { >>> >>> "size": 0, >>> >>> "aggs": { >>> "fingerprints": { >>> "terms": { >>> "field": "fingerprint", >>> "size": 100 >>> } >>> } >>> } >>> >>> } >>> >>> >>> The result's a bit too big to paste here. Anything specific about it you >>> want me to expose? >>> >>> >>> Michael. >>> >>> >>> On Tuesday, January 6, 2015 12:14:55 PM UTC-8, Itamar Syn-Hershko wrote: >>>> >>>> Can you share the query and example results please? >>>> >>>> -- >>>> >>>> Itamar Syn-Hershko >>>> http://code972.com | @synhershko <https://twitter.com/synhershko> >>>> Freelance Developer & Consultant >>>> Author of RavenDB in Action <http://manning.com/synhershko/> >>>> >>>> On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani <[email protected]> >>>> wrote: >>>> >>>>> Hello, >>>>> I'm working on a corpus of size approximately 10 million documents. >>>>> The issue I'm running into right now is that the top scoring documents >>>>> that >>>>> come back from my query are essentially all the same result. I'm trying to >>>>> find a way to get back unique results. >>>>> >>>>> I've looked into modeling the data differently with nested objects or >>>>> parent-child relationships, but neither layout seems to fit the bill. The >>>>> nested model won't work because some of the documents have too many >>>>> closely >>>>> related objects. On the flip side there are also too many unique documents >>>>> for the parent-child relationship to fit. >>>>> >>>>> I then tried the "top hits aggregation" and it's exactly what I'm >>>>> looking for, except the running time of the query is approximately 30x >>>>> slower than the query without the aggregation. Are there known performance >>>>> issues with "top hits"? Any ideas on what I should use to make these >>>>> queries? Here's the aggregation piece: >>>>> "aggs": { >>>>> >>>>> "top-fingerprints": { >>>>> "terms": { >>>>> "field": "fingerprint", >>>>> "size": 50 >>>>> }, >>>>> "aggs": { >>>>> "top_tag_hits": { >>>>> "top_hits": { >>>>> "size": 1, >>>>> "_source": { >>>>> "include": [ >>>>> "title" >>>>> ] >>>>> } >>>>> } >>>>> } >>>>> } >>>>> } >>>>> } >>>>> >>>>> >>>>> Thanks, >>>>> Michael >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "elasticsearch" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40goo >>>>> glegroups.com >>>>> <https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> >> -- >> Met vriendelijke groet, >> >> Martijn van Groningen >> > -- Met vriendelijke groet, Martijn van Groningen -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2BA76TyG9hR4diPzgsJKfdiJ1jD8e5dhQ5JRuunBMwqR28VdYw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
