Re: Bucket query results | top hits performance

Martijn v Groningen Thu, 08 Jan 2015 01:57:07 -0800

Micheal: I'd would expect that setting the `size` option on the terms agg
to a smaller value would have a positive impact on the total query time.
Feels like I'm missing something, can you run hot threads api (
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html#cluster-nodes-hot-threads)
while your run the search request that you've shared before? This basically
gives a cluster wide stack dump and can perhaps give me an insight why your
search request is slow.


Setting the `size` option of terms agg to 0 will return all buckets of that
can be found on the fingerprint field (which can be millions of buckets),
so I can see how this can bring down your cluster, because that simply
doesn't fit in the Java heap space.

Dustin: The `top_hits` aggregation is always nested under a bucket
aggregator (for example the `terms` bucket aggregator). For each bucket the
terms aggregator create the top_hits aggregator will create a priority
queue, where this top_hits aggregator is going to maintain the top N docs
that fall under the bucket it is in. So the time spent by the top_hits
aggregator, like any other nested aggregator depends on the number of
buckets being maintained during the execution of the search request. With
the top_hits this is more noticeable compared to for example a metric agg
(min, max, avg etc.), because of what the top_hits aggregator does.

On 7 January 2015 at 20:29, Dustin Boswell <[email protected]> wrote:

> I'm curious what the underlying algorithm is for TopHits.
>
> My mental model for ordinary aggregations is that there's basically a hash
> table of (field_value -> count) maintained (for each field being
> aggregated), and that hash table count is incremented once per document,
> and then the top K elements of that hash table are returned to the user.
> So there's O(1) work for each document scored, and then a final O(N*logN)
> sort on that hash table to get the top K, where N is the number of unique
> field_values.  It makes sense to me why this implementation would be very
> fast.
>
> My mental model for a top_hits aggregation is that there's a hash table of
> (field_value -> array(pair(doc_id, score))).  And for each document being
> scored, that (doc_id, score) is appended to the corresponding array. Again,
> there's only O(1) work for each document.  At the end, you have to sort
> each array, and then sort the hash table, and take the top K1 arrays, and
> the top K2 elements of each array, and then for each doc_id, pull out the
> relevant fields to return to the user.  So definitely more work (and a lot
> more memory), but I'm not sure if this would result in the 30x increase in
> runtime we're seeing.  (And actually, for the special case where
> top_hits->size == 1, you only need the top (doc_id, score) seen, not a
> whole array, so that would be a lot faster and less memory. But I
> understand it needs to be able to handle more general cases.)
>
> Is this at all close to how it works?
>
> On Tuesday, January 6, 2015 11:20:08 PM UTC-8, Martijn v Groningen wrote:
>>
>> Hi Michael,
>>
>> In general the more buckets being returned by the parent aggregator the
>> top_hits is nested in, the more work the top_hits agg needs to do, but I
>> didn't come across performance issues with `size` on terms agg being set to
>> 50 and the time it takes to execute increasing 30 times when top_hits is
>> used. To exclude this on your side, can you play around with the `size`
>> option on terms agg?
>>
>> Also perhaps the _source of your documents are relatively large. How does
>> the top_hits agg perform without the `_source` option on the top_hits agg?
>>
>> Martijn
>>
>> On 6 January 2015 at 22:29, Michael Irani <[email protected]> wrote:
>>
>>> Sure. I simplified the query to keep things focused.
>>>
>>> This query takes about 3 seconds to run:
>>>
>>> {
>>>
>>>     "size": 0,
>>>
>>>     "aggs": {
>>>         "top-fingerprints": {
>>>             "terms": {
>>>                 "field": "fingerprint",
>>>                 "size": 50
>>>             },
>>>             "aggs": {
>>>                 "top_tag_hits": {
>>>                     "top_hits": {
>>>                         "size": 1,
>>>                         "_source": {
>>>                            "include": [
>>>                               "title"
>>>                            ]
>>>                         }
>>>                     }
>>>                 }
>>>             }
>>>         }
>>>     }
>>>
>>> }
>>>
>>>
>>> This one takes about 80 milliseconds:
>>>
>>> {
>>>
>>>     "size": 0,
>>>
>>>     "aggs": {
>>>         "fingerprints": {
>>>             "terms": {
>>>                 "field": "fingerprint",
>>>                 "size": 100
>>>             }
>>>         }
>>>     }
>>>
>>> }
>>>
>>>
>>> The result's a bit too big to paste here. Anything specific about it you 
>>> want me to expose?
>>>
>>>
>>> Michael.
>>>
>>>
>>> On Tuesday, January 6, 2015 12:14:55 PM UTC-8, Itamar Syn-Hershko wrote:
>>>>
>>>> Can you share the query and example results please?
>>>>
>>>> --
>>>>
>>>> Itamar Syn-Hershko
>>>> http://code972.com | @synhershko <https://twitter.com/synhershko>
>>>> Freelance Developer & Consultant
>>>> Author of RavenDB in Action <http://manning.com/synhershko/>
>>>>
>>>> On Tue, Jan 6, 2015 at 10:11 PM, Michael Irani <[email protected]>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>> I'm working on a corpus of size approximately 10 million documents.
>>>>> The issue I'm running into right now is that the top scoring documents 
>>>>> that
>>>>> come back from my query are essentially all the same result. I'm trying to
>>>>> find a way to get back unique results.
>>>>>
>>>>> I've looked into modeling the data differently with nested objects or
>>>>> parent-child relationships, but neither layout seems to fit the bill. The
>>>>> nested model won't work because some of the documents have too many 
>>>>> closely
>>>>> related objects. On the flip side there are also too many unique documents
>>>>> for the parent-child relationship to fit.
>>>>>
>>>>> I then tried the "top hits aggregation" and it's exactly what I'm
>>>>> looking for, except the running time of the query is approximately 30x
>>>>> slower than the query without the aggregation. Are there known performance
>>>>> issues with "top hits"? Any ideas on what I should use to make these
>>>>> queries? Here's the aggregation piece:
>>>>> "aggs": {
>>>>>
>>>>>     "top-fingerprints": {
>>>>>         "terms": {
>>>>>             "field": "fingerprint",
>>>>>             "size": 50
>>>>>         },
>>>>>         "aggs": {
>>>>>             "top_tag_hits": {
>>>>>                 "top_hits": {
>>>>>                     "size": 1,
>>>>>                     "_source": {
>>>>>                        "include": [
>>>>>                           "title"
>>>>>                        ]
>>>>>                     }
>>>>>                 }
>>>>>             }
>>>>>         }
>>>>>     }
>>>>> }
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Michael
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40goo
>>>>> glegroups.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/14e4a31c-3168-409a-8b2b-cb1e432ef433%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> --
>> Met vriendelijke groet,
>>
>> Martijn van Groningen
>>
>


-- 
Met vriendelijke groet,

Martijn van Groningen

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CA%2BA76TyG9hR4diPzgsJKfdiJ1jD8e5dhQ5JRuunBMwqR28VdYw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Bucket query results | top hits performance

Reply via email to