Martijn, Thanks for thinking about this. I tried changing the `size` on terms agg to 1, 5, 10, 25, 50 and timing didn't change much. Interestingly I also set the size to 0 which in turn took down our cluster. I tried removing the `_source` option and that didn't have any noticeable effect on performance. The payload for each of our documents is about 5k.
Michael. On Tuesday, January 6, 2015 11:20:08 PM UTC-8, Martijn v Groningen wrote: > > Hi Michael, > > In general the more buckets being returned by the parent aggregator the > top_hits is nested in, the more work the top_hits agg needs to do, but I > didn't come across performance issues with `size` on terms agg being set to > 50 and the time it takes to execute increasing 30 times when top_hits is > used. To exclude this on your side, can you play around with the `size` > option on terms agg? > > Also perhaps the _source of your documents are relatively large. How does > the top_hits agg perform without the `_source` option on the top_hits agg? > > Martijn > > On 6 January 2015 at 22:29, Michael Irani <[email protected] > <javascript:>> wrote: > >> Sure. I simplified the query to keep things focused. >> >> This query takes about 3 seconds to run: >> >> { >> >> "size": 0, >> >> "aggs": { >> "top-fingerprints": { >> "terms": { >> "field": "fingerprint", >> "size": 50 >> }, >> "aggs": { >> "top_tag_hits": { >> "top_hits": { >> "size": 1, >> "_source": { >> "include": [ >> "title" >> ] >> } >> } >> } >> } >> } >> } >> >> } >> >> >> This one takes about 80 milliseconds: >> >> { >> >> "size": 0, >> >> "aggs": { >> "fingerprints": { >> "terms": { >> "field": "fingerprint", >> "size": 100 >> } >> } >> } >> >> } >> >> >> The result's a bit too big to paste here. Anything specific about it you >> want me to expose? >> >> >> Michael. >> >> >> On Tuesday, January 6, 2015 12:14:55 PM UTC-8, Itamar Syn-Hershko wrote: >>> >>> Can you share the query and example results please? >>> >>> -- >>> >>> Itamar Syn-Hershko >>> http://code972.com | @synhershko <https://twitter.com/synhershko> >>> Freelance Developer & Consultant >>> Author of RavenDB in Action <http://manning.com/synhershko/> >>> >>> On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani <[email protected]> >>> wrote: >>> >>>> Hello, >>>> I'm working on a corpus of size approximately 10 million documents. The >>>> issue I'm running into right now is that the top scoring documents that >>>> come back from my query are essentially all the same result. I'm trying to >>>> find a way to get back unique results. >>>> >>>> I've looked into modeling the data differently with nested objects or >>>> parent-child relationships, but neither layout seems to fit the bill. The >>>> nested model won't work because some of the documents have too many >>>> closely >>>> related objects. On the flip side there are also too many unique documents >>>> for the parent-child relationship to fit. >>>> >>>> I then tried the "top hits aggregation" and it's exactly what I'm >>>> looking for, except the running time of the query is approximately 30x >>>> slower than the query without the aggregation. Are there known performance >>>> issues with "top hits"? Any ideas on what I should use to make these >>>> queries? Here's the aggregation piece: >>>> "aggs": { >>>> >>>> "top-fingerprints": { >>>> "terms": { >>>> "field": "fingerprint", >>>> "size": 50 >>>> }, >>>> "aggs": { >>>> "top_tag_hits": { >>>> "top_hits": { >>>> "size": 1, >>>> "_source": { >>>> "include": [ >>>> "title" >>>> ] >>>> } >>>> } >>>> } >>>> } >>>> } >>>> } >>>> >>>> >>>> Thanks, >>>> Michael >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095% >>>> 40googlegroups.com >>>> <https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Met vriendelijke groet, > > Martijn van Groningen > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/807b8f8f-a944-4301-b476-185c46ede468%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
