Re: Bucket query results | top hits performance

Dustin Boswell Wed, 07 Jan 2015 11:29:59 -0800

I'm curious what the underlying algorithm is for TopHits.

My mental model for ordinary aggregations is that there's basically a hash 
table of (field_value -> count) maintained (for each field being 
aggregated), and that hash table count is incremented once per document, 
and then the top K elements of that hash table are returned to the user. 
 So there's O(1) work for each document scored, and then a final O(N*logN) 
sort on that hash table to get the top K, where N is the number of unique 
field_values.  It makes sense to me why this implementation would be very 
fast.


My mental model for a top_hits aggregation is that there's a hash table of 
(field_value -> array(pair(doc_id, score))).  And for each document being 
scored, that (doc_id, score) is appended to the corresponding array. Again, 
there's only O(1) work for each document.  At the end, you have to sort 
each array, and then sort the hash table, and take the top K1 arrays, and 
the top K2 elements of each array, and then for each doc_id, pull out the 
relevant fields to return to the user.  So definitely more work (and a lot 
more memory), but I'm not sure if this would result in the 30x increase in 
runtime we're seeing.  (And actually, for the special case where 
top_hits->size == 1, you only need the top (doc_id, score) seen, not a 
whole array, so that would be a lot faster and less memory. But I 
understand it needs to be able to handle more general cases.)

Is this at all close to how it works?

On Tuesday, January 6, 2015 11:20:08 PM UTC-8, Martijn v Groningen wrote:
>
> Hi Michael,
>
> In general the more buckets being returned by the parent aggregator the 
> top_hits is nested in, the more work the top_hits agg needs to do, but I 
> didn't come across performance issues with `size` on terms agg being set to 
> 50 and the time it takes to execute increasing 30 times when top_hits is 
> used. To exclude this on your side, can you play around with the `size` 
> option on terms agg?
>
> Also perhaps the _source of your documents are relatively large. How does 
> the top_hits agg perform without the `_source` option on the top_hits agg? 
>
> Martijn
>
> On 6 January 2015 at 22:29, Michael Irani <[email protected] 
> <javascript:>> wrote:
>
>> Sure. I simplified the query to keep things focused.
>>
>> This query takes about 3 seconds to run:
>>
>> {
>>
>>     "size": 0,
>>
>>     "aggs": {
>>         "top-fingerprints": {
>>             "terms": {
>>                 "field": "fingerprint",
>>                 "size": 50
>>             },
>>             "aggs": {
>>                 "top_tag_hits": {
>>                     "top_hits": {
>>                         "size": 1,
>>                         "_source": {
>>                            "include": [
>>                               "title"
>>                            ]
>>                         }
>>                     }
>>                 }
>>             }
>>         }
>>     }
>>
>> }
>>
>>
>> This one takes about 80 milliseconds:
>>
>> {
>>
>>     "size": 0,
>>
>>     "aggs": {
>>         "fingerprints": {
>>             "terms": {
>>                 "field": "fingerprint",
>>                 "size": 100
>>             }
>>         }
>>     }
>>
>> }
>>
>>
>> The result's a bit too big to paste here. Anything specific about it you 
>> want me to expose?
>>
>>
>> Michael.
>>
>>
>> On Tuesday, January 6, 2015 12:14:55 PM UTC-8, Itamar Syn-Hershko wrote:
>>>
>>> Can you share the query and example results please?
>>>
>>> --
>>>
>>> Itamar Syn-Hershko
>>> http://code972.com | @synhershko <https://twitter.com/synhershko>
>>> Freelance Developer & Consultant
>>> Author of RavenDB in Action <http://manning.com/synhershko/>
>>>
>>> On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani <[email protected]> 
>>> wrote:
>>>
>>>> Hello,
>>>> I'm working on a corpus of size approximately 10 million documents. The 
>>>> issue I'm running into right now is that the top scoring documents that 
>>>> come back from my query are essentially all the same result. I'm trying to 
>>>> find a way to get back unique results.
>>>>
>>>> I've looked into modeling the data differently with nested objects or 
>>>> parent-child relationships, but neither layout seems to fit the bill. The 
>>>> nested model won't work because some of the documents have too many 
>>>> closely 
>>>> related objects. On the flip side there are also too many unique documents 
>>>> for the parent-child relationship to fit.
>>>>
>>>> I then tried the "top hits aggregation" and it's exactly what I'm 
>>>> looking for, except the running time of the query is approximately 30x 
>>>> slower than the query without the aggregation. Are there known performance 
>>>> issues with "top hits"? Any ideas on what I should use to make these 
>>>> queries? Here's the aggregation piece:
>>>> "aggs": {
>>>>
>>>>     "top-fingerprints": {
>>>>         "terms": {
>>>>             "field": "fingerprint",
>>>>             "size": 50
>>>>         },
>>>>         "aggs": {
>>>>             "top_tag_hits": {
>>>>                 "top_hits": {
>>>>                     "size": 1,
>>>>                     "_source": {
>>>>                        "include": [
>>>>                           "title"
>>>>                        ]
>>>>                     }
>>>>                 }
>>>>             }
>>>>         }
>>>>     }
>>>> }
>>>>
>>>>
>>>> Thanks,
>>>> Michael
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%
>>>> 40googlegroups.com 
>>>> <https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Met vriendelijke groet,
>
> Martijn van Groningen
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/52497ce2-cc18-4d75-a36e-dbc884288672%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Bucket query results | top hits performance

Reply via email to