Hi Mike

Looks like you are trying to get a list of the distinct item ids in a
result set, ordered by the most frequent item ids?

Can you use collapsing qparser for this instead? Should be much quicker.

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results

Every document with the same item_id would need to be on the same
shard for this to work, and I'm not sure you can actually get the
count of collapsed documents or not, if that is necessary for you.


Another option might be to use hyperloglog function - hll() - instead
of unique(), which should give slightly better performance.

Cheers

Tom

On Thu, Feb 9, 2017 at 11:58 AM, Bryant, Michael
<michael.bry...@kcl.ac.uk> wrote:
> Hi all,
>
> I'm converting my legacy facets to JSON facets and am seeing much better 
> performance, especially with high cardinality facet fields. However, the one 
> issue I can't seem to resolve is excessive memory usage (and OOM errors) when 
> trying to simulate the effect of "group.facet" to sort facets according to a 
> grouping field.
>
> My situation, slightly simplified is:
>
> Solr 4.6.1
>
>   *   Doc set: ~200,000 docs
>   *   Grouping by item_id, an indexed, stored, single value string field with 
> ~50,000 unique values, ~4 docs per item
>   *   Faceting by person_id, an indexed, stored, multi-value string field 
> with ~50,000 values (w/ a very skewed distribution)
>   *   No docValues fields
>
> Each document here is a description of an item, and there are several 
> descriptions per item in multiple languages.
>
> With legacy facets I use group.field=item_id and group.facet=true, which 
> gives me facet counts with the number of items rather than descriptions, and 
> correctly sorted by descending item count.
>
> With JSON facets I'm doing the equivalent like so:
>
> &json.facet={
>     "people": {
>         "type": "terms",
>         "field": "person_id",
>         "facet": {
>             "grouped_count": "unique(item_id)"
>         },
>         "sort": "grouped_count desc"
>     }
> }
>
> This works, and is somewhat faster than legacy faceting, but it also produces 
> a massive spike in memory usage when (and only when) the sort parameter is 
> set to the aggregate field. A server that runs happily with a 512MB heap OOMs 
> unless I give it a 4GB heap. With sort set to (the default) "count desc" 
> there is no memory usage spike.
>
> I would be curious if anyone has experienced this kind of memory usage when 
> sorting JSON facets by stats and if there’s anything I can do to mitigate it. 
> I’ve tried reindexing with docValues enabled on the relevant fields and it 
> seems to make no difference in this respect.
>
> Many thanks,
> ~Mike

Reply via email to