[
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754164#comment-16754164
]
Michael Gibney edited comment on SOLR-13132 at 1/29/19 3:03 AM:
----------------------------------------------------------------
I've refined the earlier patch (implementing parallel facet count collection
for sort-by-relatedness). For consideration, the [^SOLR-13132-with-cache.patch]
also implements a per-segment (and top-level) cache of facet counts (and inline
"missing" bucket collection, fwiw).
As described [in a more discursive blog
post|https://michaelgibney.net/2019/01/solr-terms-skg-performance/], the facet
cache is something that's been in the back of my mind for a while, but would
have a particular impact on sort-by-relatedness with parallel facet count
collection, so I modified an initial implementation from simple facets
{{DocValuesFacets}} to make it compatible with JSON facets as well.
For my use case this yields anywhere from 5x-450x latency reduction for high-
and even modestly-high-cardinality domain queries with sort-by-relatedness.
Facet cache alone yields ~10x latency reduction for simple sort-by-count facets
over common/cached high-cardinality domains (e.g., {{*:*}}). More detail (rough
benchmarks, etc.) can be found in the blog post linked above.
To enable facet cache, in {{solrconfig.xml}}:
{code:xml}
<cache name="termFacetCache"
class="solr.search.LRUCache"
size="200"
initialSize="200"
autowarmCount="200"
regenerator="solr.request.TermFacetCacheRegenerator" />
{code}
(I realize the "facet cache" should probably be a separate issue, but given its
particular relevance as a complement to this issue, I opted to include it in
this patch. I hope that's ok ...)
was (Author: mgibney):
I've refined the earlier patch (implementing parallel facet count collection
for sort-by-relatedness). For consideration, the [new
patch|^SOLR-13132-with-cache.patch] also implements a per-segment (and
top-level) cache of facet counts (and inline "missing" bucket collection, fwiw).
As described [in a more discursive blog
post|https://michaelgibney.net/2019/01/solr-terms-skg-performance/], the facet
cache is something that's been in the back of my mind for a while, but would
have a particular impact on sort-by-relatedness with parallel facet count
collection, so I modified an initial implementation from simple facets
{{DocValuesFacets}} to make it compatible with JSON facets as well.
FYI, for my (real-world) test use case this yields anywhere from 5x-450x
latency reduction for high- and even modestly-high-cardinality domain queries
with sort-by-relatedness. Facet cache alone yields ~10x latency reduction for
simple sort-by-count facets over common/cached high-cardinality domains (e.g.,
{{*:*}}). More detail (rough benchmarks, etc.) can be found in the blog post
linked above.
To enable facet cache, in {{solrconfig.xml}}:
{code:xml}
<cache name="termFacetCache"
class="solr.search.LRUCache"
size="200"
initialSize="200"
autowarmCount="200"
regenerator="solr.request.TermFacetCacheRegenerator" />
{code}
(I realize the "facet cache" should probably be a separate issue, but given its
particular relevance as a complement to this issue, I opted to include it in
this patch. I hope that's ok ...)
> Improve JSON "terms" facet performance when sorted by relatedness
> ------------------------------------------------------------------
>
> Key: SOLR-13132
> URL: https://issues.apache.org/jira/browse/SOLR-13132
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: Facet Module
> Affects Versions: 7.4, master (9.0)
> Reporter: Michael Gibney
> Priority: Major
> Attachments: SOLR-13132-with-cache.patch, SOLR-13132.patch
>
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate
> {{relatedness}} for every term.
> The current implementation uses a standard uninverted approach (either
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain
> base docSet, and then uses that initial pass as a pre-filter for a
> second-pass, inverted approach of fetching docSets for each relevant term
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and
> set intersection operations increases request latency to the point where
> relatedness sort may not be usable in practice (for my use case, even after
> applying the patch for SOLR-13108, for a field with ~220k unique terms per
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable
> ~300ms and ~250ms respectively. The approach calculates uninverted facet
> counts over domain base, foreground, and background docSets in parallel in a
> single pass. This allows us to take advantage of the efficiencies built into
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids
> the per-term docSet creation and set intersection overhead.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]