Re: Bucket query results | top hits performance

Michael Irani Wed, 07 Jan 2015 11:01:52 -0800

Martijn,
Thanks for thinking about this. I tried changing the `size` on terms agg to 
1, 5, 10, 25, 50 and timing didn't change much. Interestingly I also set 
the size to 0 which in turn took down our cluster. I tried removing the 
`_source` option and that didn't have any noticeable effect on performance. 
The payload for each of our documents is about 5k.


Michael.

On Tuesday, January 6, 2015 11:20:08 PM UTC-8, Martijn v Groningen wrote:
>
> Hi Michael,
>
> In general the more buckets being returned by the parent aggregator the 
> top_hits is nested in, the more work the top_hits agg needs to do, but I 
> didn't come across performance issues with `size` on terms agg being set to 
> 50 and the time it takes to execute increasing 30 times when top_hits is 
> used. To exclude this on your side, can you play around with the `size` 
> option on terms agg?
>
> Also perhaps the _source of your documents are relatively large. How does 
> the top_hits agg perform without the `_source` option on the top_hits agg? 
>
> Martijn
>
> On 6 January 2015 at 22:29, Michael Irani <[email protected] 
> <javascript:>> wrote:
>
>> Sure. I simplified the query to keep things focused.
>>
>> This query takes about 3 seconds to run:
>>
>> {
>>
>>     "size": 0,
>>
>>     "aggs": {
>>         "top-fingerprints": {
>>             "terms": {
>>                 "field": "fingerprint",
>>                 "size": 50
>>             },
>>             "aggs": {
>>                 "top_tag_hits": {
>>                     "top_hits": {
>>                         "size": 1,
>>                         "_source": {
>>                            "include": [
>>                               "title"
>>                            ]
>>                         }
>>                     }
>>                 }
>>             }
>>         }
>>     }
>>
>> }
>>
>>
>> This one takes about 80 milliseconds:
>>
>> {
>>
>>     "size": 0,
>>
>>     "aggs": {
>>         "fingerprints": {
>>             "terms": {
>>                 "field": "fingerprint",
>>                 "size": 100
>>             }
>>         }
>>     }
>>
>> }
>>
>>
>> The result's a bit too big to paste here. Anything specific about it you 
>> want me to expose?
>>
>>
>> Michael.
>>
>>
>> On Tuesday, January 6, 2015 12:14:55 PM UTC-8, Itamar Syn-Hershko wrote:
>>>
>>> Can you share the query and example results please?
>>>
>>> --
>>>
>>> Itamar Syn-Hershko
>>> http://code972.com | @synhershko <https://twitter.com/synhershko>
>>> Freelance Developer & Consultant
>>> Author of RavenDB in Action <http://manning.com/synhershko/>
>>>
>>> On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani <[email protected]> 
>>> wrote:
>>>
>>>> Hello,
>>>> I'm working on a corpus of size approximately 10 million documents. The 
>>>> issue I'm running into right now is that the top scoring documents that 
>>>> come back from my query are essentially all the same result. I'm trying to 
>>>> find a way to get back unique results.
>>>>
>>>> I've looked into modeling the data differently with nested objects or 
>>>> parent-child relationships, but neither layout seems to fit the bill. The 
>>>> nested model won't work because some of the documents have too many 
>>>> closely 
>>>> related objects. On the flip side there are also too many unique documents 
>>>> for the parent-child relationship to fit.
>>>>
>>>> I then tried the "top hits aggregation" and it's exactly what I'm 
>>>> looking for, except the running time of the query is approximately 30x 
>>>> slower than the query without the aggregation. Are there known performance 
>>>> issues with "top hits"? Any ideas on what I should use to make these 
>>>> queries? Here's the aggregation piece:
>>>> "aggs": {
>>>>
>>>>     "top-fingerprints": {
>>>>         "terms": {
>>>>             "field": "fingerprint",
>>>>             "size": 50
>>>>         },
>>>>         "aggs": {
>>>>             "top_tag_hits": {
>>>>                 "top_hits": {
>>>>                     "size": 1,
>>>>                     "_source": {
>>>>                        "include": [
>>>>                           "title"
>>>>                        ]
>>>>                     }
>>>>                 }
>>>>             }
>>>>         }
>>>>     }
>>>> }
>>>>
>>>>
>>>> Thanks,
>>>> Michael
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%
>>>> 40googlegroups.com 
>>>> <https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Met vriendelijke groet,
>
> Martijn van Groningen
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/807b8f8f-a944-4301-b476-185c46ede468%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Bucket query results | top hits performance

Reply via email to