[jira] [Comment Edited] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Michael Gibney (JIRA) Mon, 28 Jan 2019 19:04:55 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754164#comment-16754164
 ]


Michael Gibney edited comment on SOLR-13132 at 1/29/19 3:03 AM:
----------------------------------------------------------------

I've refined the earlier patch (implementing parallel facet count collection 
for sort-by-relatedness). For consideration, the [^SOLR-13132-with-cache.patch] 
also implements a per-segment (and top-level) cache of facet counts (and inline 
"missing" bucket collection, fwiw).

As described [in a more discursive blog 
post|https://michaelgibney.net/2019/01/solr-terms-skg-performance/], the facet 
cache is something that's been in the back of my mind for a while, but would 
have a particular impact on sort-by-relatedness with parallel facet count 
collection, so I modified an initial implementation from simple facets 
{{DocValuesFacets}} to make it compatible with JSON facets as well.

For my use case this yields anywhere from 5x-450x latency reduction for high- 
and even modestly-high-cardinality domain queries with sort-by-relatedness. 
Facet cache alone yields ~10x latency reduction for simple sort-by-count facets 
over common/cached high-cardinality domains (e.g., {{*:*}}). More detail (rough 
benchmarks, etc.) can be found in the blog post linked above.

To enable facet cache, in {{solrconfig.xml}}:
{code:xml}
<cache name="termFacetCache"
       class="solr.search.LRUCache"
       size="200"
       initialSize="200"
       autowarmCount="200"
       regenerator="solr.request.TermFacetCacheRegenerator" />
{code}
(I realize the "facet cache" should probably be a separate issue, but given its 
particular relevance as a complement to this issue, I opted to include it in 
this patch. I hope that's ok ...)


was (Author: mgibney):
I've refined the earlier patch (implementing parallel facet count collection 
for sort-by-relatedness). For consideration, the [new 
patch|^SOLR-13132-with-cache.patch] also implements a per-segment (and 
top-level) cache of facet counts (and inline "missing" bucket collection, fwiw).

As described [in a more discursive blog 
post|https://michaelgibney.net/2019/01/solr-terms-skg-performance/], the facet 
cache is something that's been in the back of my mind for a while, but would 
have a particular impact on sort-by-relatedness with parallel facet count 
collection, so I modified an initial implementation from simple facets 
{{DocValuesFacets}} to make it compatible with JSON facets as well.

FYI, for my (real-world) test use case this yields anywhere from 5x-450x 
latency reduction for high- and even modestly-high-cardinality domain queries 
with sort-by-relatedness. Facet cache alone yields ~10x latency reduction for 
simple sort-by-count facets over common/cached high-cardinality domains (e.g., 
{{*:*}}). More detail (rough benchmarks, etc.) can be found in the blog post 
linked above.

To enable facet cache, in {{solrconfig.xml}}:
{code:xml}
<cache name="termFacetCache"
       class="solr.search.LRUCache"
       size="200"
       initialSize="200"
       autowarmCount="200"
       regenerator="solr.request.TermFacetCacheRegenerator" />
{code}
(I realize the "facet cache" should probably be a separate issue, but given its 
particular relevance as a complement to this issue, I opted to include it in 
this patch. I hope that's ok ...)

> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-with-cache.patch, SOLR-13132.patch
>
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Reply via email to