Micheal & Dustin, what should reduce the query time a lot is if you set
`collect_mode` to `breadth_first` on the `top-fingerprints` agg. Like this:
GET /_search?search_type=count
{
"aggs": {
"top-fingerprints": {
"terms": {
"field": "fingerprint",
"size": 50,
"collect_mode": "breadth_first"
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"size": 1,
"_source": {
"include": [
"title"
]
},
"sort": {
"_doc": {}
}
}
}
}
}
}
}
By default the the top_hits agg will create and maintain a priority hit
queue for all buckets that are created by the terms agg, so also the ones
outside of the top 50, which can potentially be millions. By telling the
terms agg to run in breadth_first mode the top_hits only creates and
maintains a priority hit queue for the top 50 buckets instead of all
buckets. This should make things much better performance wise. There is one
catch to it, the top_hits can't sort by score any more (which is the
default), because the breadth_first collect mode doesn't buffer scores.
That is why the sort is defined on the top_hits agg. In this example I sort
by Lucene docid, which is a kind of arbitrary, because you can't have
control over these sort values, but you can sort by any field in your
mapping.
More information about collect mode:
1)
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_collect_mode
2)
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_preventing_combinatorial_explosions.html#_depth_first_versus_breadth_first
On 8 January 2015 at 10:56, Martijn v Groningen <
[email protected]> wrote:
> Micheal: I'd would expect that setting the `size` option on the terms agg
> to a smaller value would have a positive impact on the total query time.
> Feels like I'm missing something, can you run hot threads api (
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html#cluster-nodes-hot-threads)
> while your run the search request that you've shared before? This basically
> gives a cluster wide stack dump and can perhaps give me an insight why your
> search request is slow.
>
> Setting the `size` option of terms agg to 0 will return all buckets of
> that can be found on the fingerprint field (which can be millions of
> buckets), so I can see how this can bring down your cluster, because that
> simply doesn't fit in the Java heap space.
>
> Dustin: The `top_hits` aggregation is always nested under a bucket
> aggregator (for example the `terms` bucket aggregator). For each bucket the
> terms aggregator create the top_hits aggregator will create a priority
> queue, where this top_hits aggregator is going to maintain the top N docs
> that fall under the bucket it is in. So the time spent by the top_hits
> aggregator, like any other nested aggregator depends on the number of
> buckets being maintained during the execution of the search request. With
> the top_hits this is more noticeable compared to for example a metric agg
> (min, max, avg etc.), because of what the top_hits aggregator does.
>
> On 7 January 2015 at 20:29, Dustin Boswell <[email protected]> wrote:
>
>> I'm curious what the underlying algorithm is for TopHits.
>>
>> My mental model for ordinary aggregations is that there's basically a
>> hash table of (field_value -> count) maintained (for each field being
>> aggregated), and that hash table count is incremented once per document,
>> and then the top K elements of that hash table are returned to the user.
>> So there's O(1) work for each document scored, and then a final O(N*logN)
>> sort on that hash table to get the top K, where N is the number of unique
>> field_values. It makes sense to me why this implementation would be very
>> fast.
>>
>> My mental model for a top_hits aggregation is that there's a hash table
>> of (field_value -> array(pair(doc_id, score))). And for each document
>> being scored, that (doc_id, score) is appended to the corresponding array.
>> Again, there's only O(1) work for each document. At the end, you have to
>> sort each array, and then sort the hash table, and take the top K1 arrays,
>> and the top K2 elements of each array, and then for each doc_id, pull out
>> the relevant fields to return to the user. So definitely more work (and a
>> lot more memory), but I'm not sure if this would result in the 30x increase
>> in runtime we're seeing. (And actually, for the special case where
>> top_hits->size == 1, you only need the top (doc_id, score) seen, not a
>> whole array, so that would be a lot faster and less memory. But I
>> understand it needs to be able to handle more general cases.)
>>
>> Is this at all close to how it works?
>>
>> On Tuesday, January 6, 2015 11:20:08 PM UTC-8, Martijn v Groningen wrote:
>>>
>>> Hi Michael,
>>>
>>> In general the more buckets being returned by the parent aggregator the
>>> top_hits is nested in, the more work the top_hits agg needs to do, but I
>>> didn't come across performance issues with `size` on terms agg being set to
>>> 50 and the time it takes to execute increasing 30 times when top_hits is
>>> used. To exclude this on your side, can you play around with the `size`
>>> option on terms agg?
>>>
>>> Also perhaps the _source of your documents are relatively large. How
>>> does the top_hits agg perform without the `_source` option on the top_hits
>>> agg?
>>>
>>> Martijn
>>>
>>> On 6 January 2015 at 22:29, Michael Irani <[email protected]> wrote:
>>>
>>>> Sure. I simplified the query to keep things focused.
>>>>
>>>> This query takes about 3 seconds to run:
>>>>
>>>> {
>>>>
>>>> "size": 0,
>>>>
>>>> "aggs": {
>>>> "top-fingerprints": {
>>>> "terms": {
>>>> "field": "fingerprint",
>>>> "size": 50
>>>> },
>>>> "aggs": {
>>>> "top_tag_hits": {
>>>> "top_hits": {
>>>> "size": 1,
>>>> "_source": {
>>>> "include": [
>>>> "title"
>>>> ]
>>>> }
>>>> }
>>>> }
>>>> }
>>>> }
>>>> }
>>>>
>>>> }
>>>>
>>>>
>>>> This one takes about 80 milliseconds:
>>>>
>>>> {
>>>>
>>>> "size": 0,
>>>>
>>>> "aggs": {
>>>> "fingerprints": {
>>>> "terms": {
>>>> "field": "fingerprint",
>>>> "size": 100
>>>> }
>>>> }
>>>> }
>>>>
>>>> }
>>>>
>>>>
>>>> The result's a bit too big to paste here. Anything specific about it you
>>>> want me to expose?
>>>>
>>>>
>>>> Michael.
>>>>
>>>>
>>>> On Tuesday, January 6, 2015 12:14:55 PM UTC-8, Itamar Syn-Hershko wrote:
>>>>>
>>>>> Can you share the query and example results please?
>>>>>
>>>>> --
>>>>>
>>>>> Itamar Syn-Hershko
>>>>> http://code972.com | @synhershko <https://twitter.com/synhershko>
>>>>> Freelance Developer & Consultant
>>>>> Author of RavenDB in Action <http://manning.com/synhershko/>
>>>>>
>>>>> On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>> I'm working on a corpus of size approximately 10 million documents.
>>>>>> The issue I'm running into right now is that the top scoring documents
>>>>>> that
>>>>>> come back from my query are essentially all the same result. I'm trying
>>>>>> to
>>>>>> find a way to get back unique results.
>>>>>>
>>>>>> I've looked into modeling the data differently with nested objects or
>>>>>> parent-child relationships, but neither layout seems to fit the bill. The
>>>>>> nested model won't work because some of the documents have too many
>>>>>> closely
>>>>>> related objects. On the flip side there are also too many unique
>>>>>> documents
>>>>>> for the parent-child relationship to fit.
>>>>>>
>>>>>> I then tried the "top hits aggregation" and it's exactly what I'm
>>>>>> looking for, except the running time of the query is approximately 30x
>>>>>> slower than the query without the aggregation. Are there known
>>>>>> performance
>>>>>> issues with "top hits"? Any ideas on what I should use to make these
>>>>>> queries? Here's the aggregation piece:
>>>>>> "aggs": {
>>>>>>
>>>>>> "top-fingerprints": {
>>>>>> "terms": {
>>>>>> "field": "fingerprint",
>>>>>> "size": 50
>>>>>> },
>>>>>> "aggs": {
>>>>>> "top_tag_hits": {
>>>>>> "top_hits": {
>>>>>> "size": 1,
>>>>>> "_source": {
>>>>>> "include": [
>>>>>> "title"
>>>>>> ]
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Michael
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>>> msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40goo
>>>>>> glegroups.com
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%
>>>> 40googlegroups.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> --
>>> Met vriendelijke groet,
>>>
>>> Martijn van Groningen
>>>
>>
>
>
> --
> Met vriendelijke groet,
>
> Martijn van Groningen
>
--
Met vriendelijke groet,
Martijn van Groningen
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CA%2BA76TxAAopyqdVgfC5Zb2iYA4%2BtxNROo2TK7Yw9p09aOEHS%2Bw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.